Table of Contents

Scalapack linsolve benchmark

Fortran 90 source code

We base this benchmark in the linsolve.f90 from the online tutorial Running on Mio Workshop presented January 25, 2011 at the Colorado School of Mines. see slides.

We get the program with

if [ ! -f "linsolve.f90" ]; then
  wget http://inside.mines.edu/mio/tutorial/libs/solvers/linsolve.f90
fi

This program reads one line to start the benchmark. The input must contain 5 numbers:

Where ROW_BLOCK_SIZE * COL_BLOCK_SIZE cannot exceed 500*500.

For this benchmark we will set N = NPROC_ROWS*ROW_BLOCK_SIZE = NPROC_ROWS*ROW_BLOCK_SIZE. For example, to construct a one line file, in.dat, from N and the maximum block sizes of 500:

let N=3000
let ROW_BLOCK_SIZE=500
let COL_BLOCK_SIZE=500
let NPROC_ROWS=$N/$ROW_BLOCK_SIZE
let NPROC_COLS=$N/$COL_BLOCK_SIZE
echo "$N $NPROC_ROWS $NPROC_ROWS $ROW_BLOCK_SIZE $COL_BLOCK_SIZE" > in.dat
To allow larger blocks you could extend the two MAX parameters in the linsolve.f90 file. for example,
MAX_VECTOR_SIZE from 1000 to 2000
MMAX_MATRIX_SIZE from 250000 to 1000000

To accommodate these larger sizes some of the FORMAT statements should have I4 instead of I2 and I3.

Compiling

First set the variables

Since this test is completely contained in one Fortran 90 program you can compile with one compile, link and load with one command.

vpkg_rollback all
vpkg_devrequire $packages
 
mpif90 $f90flags -o solve linsolve.f90 $LDFLAGS $libs 

Some version of the mpif90 wrapper should be add to you environment, either explicitly or implicitly by VALET. Also VALET will set the LDFLAGS for your compile statement.

Grid engine script file

You must run the solve executable on the compute nodes, and submit with Grid Engine. You will need a script, which we will copy from /opt/shared/templates/openmpi/openmpi-ibverb.qs and modified so

For example, with the packages, NPROC_ROWS and NPROC_COLS variables set as above:

let NPROC=$NPROC_ROWS*$NPROC_COLS
if [ ! -f "template.qs" ]; then
  sed -e 's/\$MY_EXE/.\/solve < in.dat/g;s/^echo "\(.*\)--"/date "+\1 %D %T--"/' \
     /opt/shared/templates/openmpi/openmpi-ibverb.qs > template.qs
  echo "new copy of template in template.qs"
fi
sed "s/NPROC/$NPROC/g;s#^vpkg_require.*#vpkg_require $packages#" template.qs > solve.qs

The file solve.qs will contain a script file to run the executable ./solve that reads the data from in.dat. Also $NPROCS is set the number of processors required on handle $NPROC_ROWS x $NPROC_COLS blocks.

There is only one executable, solve, one data file in.dat and one batch script file solve.qs. There may be different test versions to test alternate compilers, libraries and problem sizes. But you should not change these files until the Grid Engine jobs are done. Use the qsub -sync y command to wait for all jobs to complete.

Submitting

There is only linear system solve, and it should take just a few seconds. We will use the under 4hr standby queue for testing. The variables name and N are set as described above.

qsub -N $name$N -l standby=1 -l h_rt=04:00:00 solve.qs

Tests

gcc

name=gcc
packages=scalapack/2.0.1-openmpi-gcc
libs="-lscalapack -llapack -lblas"
f90flags=''

gcc and atlas

name=gcc_atlas
packages='scalapack atlas'
libs="-lscalapack -llapack -lcblas -lf77blas -latlas"
f90flags=''

The documentation in /opt/shared/atlas/3.6.0-gcc/doc/LibReadme.txt gives the loader flags to use to load the atlas libraries, the are: -LLIBDIR -llapack -lcblas -lf77blas -latlas. The -L directory is in the LDFLAGS, which is supplied by VALET.

Also from the same documentation:

ATLAS does not provide a full LAPACK library.

This means the order the VALET packages are added is important. If we add atlas before scalapack, then the loader will not be able to find some LAPACK routines.

But this may not be optimal:

Just linking in ATLAS's liblapack.a first will not get you the best LAPACK
performance, mainly because LAPACK's untuned ILAENV will be used instead
of ATLAS's tuned one.

With these variables set and packages changed to

packages='scalapack atlas'

we get ld errors:

      ...
/opt/shared/atlas/3.6.0-gcc/lib/Linux_BULLDOZER/libf77blas.a(xerbla.o): In function `xerbla_':
xerbla.f:(.text+0x18): undefined reference to `s_wsfe'
      ... 

Explanation: the funcion xerbla_ in the library 'f77blas' was compiled with g77, not gfortran. So it needs the support routines in the g2c library. Knowing the name of this library wech find a copy:

  find /usr -name libg2c.a
find: `/usr/lib64/audit': Permission denied
/usr/lib/gcc/x86_64-redhat-linux/3.4.6/32/libg2c.a
/usr/lib/gcc/x86_64-redhat-linux/3.4.6/libg2c.a

To remove these errors, change:

libs="-lscalapack -llapack -lcblas -lf77blas -latlas -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6 -lg2c"

New ld errors:

   ...
   linsolve.f90:(.text+0xd00): undefined reference to `slarnv_'
   ...

Explanation: The atlas version of lapack does not have all of the LAPACK routines.

  nm -g /opt/shared/atlas/3.6.0-gcc/lib/Linux_BULLDOZER/liblapack.a | grep slarnv_
  nm -g /opt/shared/scalapack/2.0.1-openmpi-gcc/lib/liblapack.a | grep slarnv_
                 U slarnv_
0000000000000000 T slarnv_
                 U slarnv_
                 U slarnv_

No output from first nm command.

You can copy the full atlas directory in your working direction and then follow the directions in /opt/shared/atlas/3.6.0-gcc/doc/LibReadme.txt, section:

**** GETTING A FULL LAPACK LIB ****

We call this library myatlas.

gcc and myatlas

name=gcc_myatlas
packages='scalapack'
libs="-lscalapack -L./lib -llapack -lcblas -lf77blas -latlas"
f90flags=''

This requires a copy of atlas in your directory, ./lib and uses the g2c lib in the /usr/lib directory. You need to build your own copy of atlas.

Assuming you do not have a lib directory in your working directory:

  cp -a /opt/shared/atlas/3.6.0-gcc/lib/Linux_BULLDOZER lib
  ar x lib/liblapack.a
  cp /opt/shared/scalapack/2.0.1-openmpi-gcc/lib/liblapack.a lib
  ar r lib/liblapack.a *.o
  rm *.o
  cp /usr/lib/gcc/x86_64-redhat-linux/3.4.6/libg2c.a lib

Now you have a ./lib directory with your own copy of the atlas libraries, and the g2c library. You no longer need to specify atlas VALET package.

gcc and myptatlas

name=gcc_myptatlas
packages='scalapack'
libs="-lscalapack -L./lib -llapack -lptcblas -lptf77blas -latlas"
f90flags=''

Parallel threads will dynamically uses all the cores available at compile time (24), but only if problem size indicates they will help.

pgi and acml

name=pgi_acml
packages=scalapack/2.0.1-openmpi-pgi
libs="-lscalapack -lacml"
f90flags=''

intel and mkl

name=intel_mkl
packages=openmpi/1.4.4-intel64
libs="-lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64"
f90flags="-mkl"

Results N=4000

BLOCK=1000, NPROCS=16

Each test is repeated three time.

File name Time
gcc4000.o91943 Elapsed time = 0.613728D+05 milliseconds
gcc4000.o92019 Elapsed time = 0.862935D+05 milliseconds
gcc4000.o92030 Elapsed time = 0.826695D+05 milliseconds
gcc_atlas4000.o91945 Elapsed time = 0.386161D+04 milliseconds
gcc_atlas4000.o92023 Elapsed time = 0.433195D+04 milliseconds
gcc_atlas4000.o92035 Elapsed time = 0.424980D+04 milliseconds
gcc_myatlas4000.o92009 Elapsed time = 0.448106D+04 milliseconds
gcc_myatlas4000.o92026 Elapsed time = 0.461706D+04 milliseconds
gcc_myatlas4000.o92032 Elapsed time = 0.441593D+04 milliseconds
intel_mkl4000.o91611 Elapsed time = 0.222194D+05 milliseconds
intel_mkl4000.o92016 Elapsed time = 0.215223D+05 milliseconds
intel_mkl4000.o92039 Elapsed time = 0.214088D+05 milliseconds
pgi_acml4000.o91466 Elapsed time = 0.238015D+04 milliseconds
pgi_acml4000.o92017 Elapsed time = 0.261841D+04 milliseconds
pgi_acml4000.o92040 Elapsed time = 0.222939D+04 milliseconds

BLOCK=800, NPROCS=25

Each test is repeated three time.

File name Time
gcc4000.o92335 Elapsed time = 0.638246D+05 milliseconds
gcc4000.o92386 Elapsed time = 0.633060D+05 milliseconds
gcc4000.o92412 Elapsed time = 0.629561D+05 milliseconds
gcc_atlas4000.o92336 Elapsed time = 0.314615D+04 milliseconds
gcc_atlas4000.o92389 Elapsed time = 0.358208D+04 milliseconds
gcc_atlas4000.o92413 Elapsed time = 0.334147D+04 milliseconds
gcc_myatlas4000.o92337 Elapsed time = 0.363176D+04 milliseconds
gcc_myatlas4000.o92390 Elapsed time = 0.306922D+04 milliseconds
gcc_myatlas4000.o92414 Elapsed time = 0.333779D+04 milliseconds
intel_mkl4000.o92339 Elapsed time = 0.433877D+05 milliseconds
intel_mkl4000.o92393 Elapsed time = 0.400862D+05 milliseconds
intel_mkl4000.o92417 Elapsed time = 0.409855D+05 milliseconds
pgi_acml4000.o92338 Elapsed time = 0.234248D+04 milliseconds
pgi_acml4000.o92392 Elapsed time = 0.276856D+04 milliseconds
pgi_acml4000.o92415 Elapsed time = 0.211567D+04 milliseconds

BLOCK=500, NPROCS=64

Each test is repeated three time.

File name Time
gcc4000.o92123 Elapsed time = 0.284893D+05 milliseconds
gcc4000.o92144 Elapsed time = 0.278744D+05 milliseconds
gcc4000.o92150 Elapsed time = 0.289137D+05 milliseconds
gcc_atlas4000.o92130 Elapsed time = 0.296471D+04 milliseconds
gcc_atlas4000.o92142 Elapsed time = 0.264463D+04 milliseconds
gcc_atlas4000.o92148 Elapsed time = 0.269103D+04 milliseconds
gcc_myatlas4000.o92133 Elapsed time = 0.280457D+04 milliseconds
gcc_myatlas4000.o92138 Elapsed time = 0.312135D+04 milliseconds
gcc_myatlas4000.o92153 Elapsed time = 0.286337D+04 milliseconds
intel_mkl4000.o92134 Elapsed time = 0.436288D+05 milliseconds
intel_mkl4000.o92140 Elapsed time = 0.413780D+05 milliseconds
intel_mkl4000.o92152 Elapsed time = 0.401095D+05 milliseconds
pgi_acml4000.o92137 Elapsed time = 0.234475D+04 milliseconds
pgi_acml4000.o92145 Elapsed time = 0.214514D+04 milliseconds
pgi_acml4000.o92149 Elapsed time = 0.293480D+04 milliseconds

BLOCK=250, NPROCS=256

Each test is repeated three time.

File name Time
gcc4000.o92164 Elapsed time = 0.148302D+05 milliseconds
gcc4000.o92168 Elapsed time = 0.144862D+05 milliseconds
gcc4000.o92317 Elapsed time = 0.160144D+05 milliseconds
gcc_atlas4000.o92167 Elapsed time = 0.785104D+04 milliseconds
gcc_atlas4000.o92171 Elapsed time = 0.749285D+04 milliseconds
gcc_atlas4000.o92318 Elapsed time = 0.798376D+04 milliseconds
gcc_myatlas4000.o92165 Elapsed time = 0.797618D+04 milliseconds
gcc_myatlas4000.o92222 Elapsed time = 0.792745D+04 milliseconds
gcc_myatlas4000.o92320 Elapsed time = 0.720193D+04 milliseconds
intel_mkl4000.o92162 Elapsed time = 0.636915D+05 milliseconds
intel_mkl4000.o92169 Elapsed time = 0.733785D+05 milliseconds
intel_mkl4000.o92324 Elapsed time = 0.653791D+05 milliseconds
pgi_acml4000.o92161 Elapsed time = 0.740457D+04 milliseconds
pgi_acml4000.o92170 Elapsed time = 0.733668D+04 milliseconds
pgi_acml4000.o92322 Elapsed time = 0.769606D+04 milliseconds

Summary

4000 x 4000 matrix

Time to solve linear system

A randomly generated matrix is solved using ScaLAPACK with different block sizes. The times are the average elapsed time in seconds, as reported by MPI_WTIME, over three batch job submissions.

Test N=4000
name np=16 np=25 np=64 np=256
gcc 76.779 63.362 28.426 15.110
gcc_atlas 4.148 3.357 2.767 7.776
gcc_myatlas 4.505 3.346 2.930 7.702
intel_mkl 21.717 41.486 41.705 67.483
pgi_acml 2.409 2.409 2.475 7.479

There is not much difference between gcc_atlas and gcc_myatlas.

Speedup

The speedup for ATLAS, MKL and ACML are relative to GCC with no special BLAS libraries.

16000 x 16000 matrix

Time to solve linear system

A randomly generated matrix is solved using ScaLAPACK with different block sizes. The times are the average elapsed time in seconds, as reported by MPI_WTIME, over two batch job submissions.

Test N=16000
name np=16 np=64 np=256
gcc 5627.655 1682.308 474.087
gcc_atlas 231.552 81.281 44.890
gcc_myatlas 235.682 88.871 39.559
gcc_myptatlas 218.815 79.013 47.388
intel_mkl 132.071 174.859 224.309
pgi_acml 161.941 64.373 34.463

Speedup

Speedup for ATLAS, MKL and ACML compared to the reference GCC with no optimized library. ATLAS has two variants: myatlas has the complete LAPACK merged in, and myptatlas loads the multi-threaded versions of the libraries.

Time plot

Elapsed time for ATLAS, MKL and ACML.