====== ScaLAPACK example1 ======
Chapter two of the ScaLAPACK user guide has a section [[http://www.netlib.org/scalapack/slug/node26.html#SECTION04210000000000000000|How to Run an Example Program Using MPI]]. There are 7 steps to follow to download, compile and run an example program ''example1.f''. This page is meant to replace these 7 steps on Mills. It is possible to eliminates several steps, since the needed libraries are already installed on Mills and managed by VALET.
Steps:
- **get** - Download the ''example1.f''.
- **set** - Choose the compiler/library for your test.
- **compile** - Compile and link from the source file
- **submit** - Copy an openmpi template file and submit the job to run on a compute Node
===== Get ======
The ScaLAPACE user guide example is fully contained in one Fortran file, ''example1.f''. We will need to know how many MPI processors to assign when we run the example. The example subdivides the 9x9 matrix using 2 x 3 processors. These are set in a Fortran DATA statement in the program.
The following bash source file will set four variables, download the source file (if needed), and print information to confirm that the ''NPROW'' and ''NPCOL'' Fortran variables match the ''NPROC_ROWS'' and ''NPROC_COLS'' shell variables.
base="example1"
file="example1.f"
let NPROC_ROWS=2
let NPROC_COLS=3
if [ ! -f $file ]; then
wget http://www.netlib.org/scalapack/examples/$file
fi
grep DATA $file
echo "NPROW = $NPROC_ROWS NPCOL = $NPROC_COLS"
This file must be sourced, so the variable values will be available in the next step.
[dnairn@mills example1]$ . get
DATA NPROW / 2 / , NPCOL / 3 /
NPROW = 2 NPCOL = 3
===== Set ======
There are several components in the ScaLAPACK library:
- **BLACS** - Basic Linear Algebra Communication Subprograms
- **PBLAS** - Parallel BLAS
- **LAPACK** - Linear Algebra PACKage
- **BLAS** - Basic Linear Algebra Subprograms
BLACS and PBLAS use a distributed memory model of computing and must run in batch using MPI. Here we use ''openMPI''.
LAPACK is single threaded, but can get performance gains through optimized versions of BLAS. On Mills there are two ways to boost BLAS performance:
- **FMA4** - use the Fused Multiply and Add hardware vector instruction,
- **openMP** - use shared memory and multiple threads on multiple cores.
There is no one best way for all programs. We will show how you can link in alternate versions of BLAS.
There are three VALET versions of ScaLAPACK, each built with different compilers. They have all the ScaLAPACK components, but you need to know the proper loader flags and compiler flags to get all the components. Sometimes the order of the flags is important. You can load a different version of any of these components by the changing of loader flags.
In additions to the VALET packages, the 'Intel Compiler Suite' comes with ScalAPACK within the MKL. A simple compiler flag will give you LAPACK and BLAS, but you still need to specify the loader flags to get the proper versions of BLACS and PBLAS.
==== Set for gcc ====
name=gcc
packages=scalapack
libs="-lscalapack -llapack -lblas"
f90flags='-O3'
==== Set for gcc using atlas ====
Order of the packages is important, since the atlas library directory has and incomplete version of LAPACK. The LAPACK in ScaLAPACK needs to be first.
name=gcc_atlas
packages='scalapack atlas'
libs="-lscalapack -llapack -lcblas -lf77blas -latlas"
f90flags='-O3'
==== Set for gcc using ACML ====
name=gcc_acml
packages='acml/5.1.0-gcc-fma4 scalapack'
libs="-lscalapack -lacml"
f90flags='-O3'
==== Set for intel using ACML ====
me=intel_acml
packages=scalapack/2.0.2-openmpi-intel64_acml5
libs="-lscalapack -lacml"
f90flags="-O3"
==== Set for intel using MKL ====
name=intel_mkl
packages=openmpi/1.4.4-intel64
libs="-mkl -lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64"
f90flags="-O3"
==== Set for PGI using ACML ====
name=pgi_acml
packages=scalapack/2.0.1-openmpi-pgi
libs="-lscalapack -lacml"
f90flags='-O3'
===== Compile ======
This source file will use the shell variables set in the previous steps and add the VALET package before the
compile command. The command with ''-show'' is to echo all the flags used for the compile.
# $packages set to VALET packages
# $libs set to libraries
# $f90flags set to compiler flags
# $file set to source file name
# $base set to source file name base
vpkg_rollback all
vpkg_devrequire $packages
mpif90 $f90flags -show $file $LDFLAGS $libs
mpif90 $f90flags -o $base $file $LDFLAGS $libs
===== Submit ======
# $name set to name of job
# $packages set to VALET packages
# $NPROC_ROWS number of row processors
# $NPROC_COLS number of column processors
# $base set to base file name
let NPROC=$NPROC_ROWS*$NPROC_COLS
if [ ! -f "template.qs" ]; then
sed "s/\$MY_EXE/.\/$base/g;"'s/^echo "\(.*\)--"/date "+\1 %D %T--"/' \
/opt/shared/templates/openmpi/openmpi-ibverb.qs > template.qs
echo "new copy of template in template.qs"
fi
sed "s/NPROC/$NPROC/g;s#^vpkg_require.*#vpkg_require $packages#" template.qs > $base.qs
qsub $@ -N tst$name -l standby=1 -l h_rt=04:00:00 $base.qs
===== Source file ======
To submit all jobs and wait before submitting the next:
date "+begin ($s) %c"
. get
. set-gcc;. compile&&. submit -sync y
. set-gcc_atlas;. compile&&. submit -sync y
. set-pgi_acml;. compile&&. submit -sync y
. set-intel_mkl;. compile&&. submit -sync y
. set-gcc_acml;. compile&&. submit -sync y
. set-intel_acml;. compile&&. submit -sync y
date "+end (%s) %c"
====== ScaLAPACK example2 ======
The appendix of the ScaLAPACK user guide [[http://www.netlib.org/scalapack/slug/node180.html#SECTION041010000000000000000|Example Program #2]] has a more flexible version of [[:clusters:mills:scalapack-example1|example 1]] in Chapter 2.
Example 2 is distributed as a compressed tar file. The expanded directory will contain file a ''Makefile'', four Fortran files and three data fils. The ''Makefile'' needs some variables:
| TESTINGdir | Directory that will contain the executable and data files |
| F77LOADER | Command to load object files with the SCaLAPACK libraries |
| F77LOADERFLAGS | Flags for the loader to find libraries, such as -L|
| LIBS | All the required libraries, such as -l |
| F77 | Fortran compiler |
| F77FLAGS | Fortran compiler flags |
| CC | C compiler |
| CCFLAGS | C compiler flags |
| CDEFS | Other C flags, such as -D or -I |
None of these values are set, but the very first line of ''Makefile'' is:
include ../SLmake.inc
This is the place to assign all the above variables to values that work on Mills.
===== Include file =====
Here is a sample make include file that assumss you have choosen the library and openmpi version with VALET.
TESTINGdir = ../tst-gcc
F77LOADER = mpif90
F77LOADFLAGS = $(LDFLAGS)
LIBS = -lscalapack -llapack -lblas
MAKE = /usr/bin/make
F77 = mpif90
F77FLAGS = -fast
CC = mpicc
CDEFS = $(CPPFLAGS)
===== Make session =====
To use the make file with the make include file, you use a VALET ''devrequire'' command to set and export the ''LDFLAGS'' and ''CPPFLAGS'' variable. VALET will also extend the execution ''PATH'' variable so ''mpif90'', ''mpif77'' and ''mpicc'' are in your path.
Sample session:
[(it_css:dnairn)@mills scaex]$ vpkg_devrequire scalapack
Adding dependency `openmpi/1.4.4-gcc` to your environment
Adding package `scalapack/2.0.1-openmpi-gcc` to your environment
[(it_css:dnairn)@mills scaex]$ mkdir -p ../tst-gcc
[(it_css:dnairn)@mills scaex]$ make
mpif90 -L/opt/shared/openmpi/1.4.4-gcc/lib -L/opt/shared/scalapack/2.0.1-openmpi-gcc/lib -o ../tst-gcc/xdscaex pdscaex.o pdscaexinfo.o pdlaread.o pdlawrite.o -lscalapack -llapack -lblas
/usr/bin/make ../tst-gcc/SCAEX.dat
make[1]: Entering directory `/lustre/scratch/dnairn/scalapack/example2/scaex'
cp SCAEX.dat ../tst-gcc
make[1]: Leaving directory `/lustre/scratch/dnairn/scalapack/example2/scaex'
/usr/bin/make ../tst-gcc/SCAEXMAT.dat
make[1]: Entering directory `/lustre/scratch/dnairn/scalapack/example2/scaex'
cp SCAEXMAT.dat ../tst-gcc
make[1]: Leaving directory `/lustre/scratch/dnairn/scalapack/example2/scaex'
/usr/bin/make ../tst-gcc/SCAEXRHS.dat
make[1]: Entering directory `/lustre/scratch/dnairn/scalapack/example2/scaex'
cp SCAEXRHS.dat ../tst-gcc
make[1]: Leaving directory `/lustre/scratch/dnairn/scalapack/example2/scaex'
[(it_css:dnairn)@mills scaex]$
====== Scalapack linsolve benchmark ======
===== Fortran 90 source code =====
We base this benchmark in the ''linsolve.f90'' from the online tutorial //Running on Mio Workshop// presented January 25, 2011 at the Colorado School of Mines. see [[http://inside.mines.edu/mio/tutorial/pdf/libs.pdf|slides]].
We get the program with
if [ ! -f "linsolve.f90" ]; then
wget http://inside.mines.edu/mio/tutorial/libs/solvers/linsolve.f90
fi
This program reads one line to start the benchmark. The input must contain 5 numbers:
* N: order of linear system
* NPROC_ROWS: number of rows in process grid
* NPROC_COLS: number of columns in process grid
* ROW_BLOCK_SIZE: blocking size for matrix rows
* COL_BLOCK_SIZE: blocking size for matrix columns
Where ''ROW_BLOCK_SIZE * COL_BLOCK_SIZE'' cannot exceed 500*500.
For this benchmark we will set ''N = NPROC_ROWS*ROW_BLOCK_SIZE = NPROC_ROWS*ROW_BLOCK_SIZE''. For example, to construct a one line file, ''in.dat'', from N and the maximum block sizes of 500:
let N=3000
let ROW_BLOCK_SIZE=500
let COL_BLOCK_SIZE=500
let NPROC_ROWS=$N/$ROW_BLOCK_SIZE
let NPROC_COLS=$N/$COL_BLOCK_SIZE
echo "$N $NPROC_ROWS $NPROC_ROWS $ROW_BLOCK_SIZE $COL_BLOCK_SIZE" > in.dat
To allow larger blocks you could extend the two MAX parameters in the ''linsolve.f90'' file. for example,
MAX_VECTOR_SIZE from 1000 to 2000
MMAX_MATRIX_SIZE from 250000 to 1000000
To accommodate these larger sizes some of the FORMAT statements should have I4 instead of I2 and I3.
===== Compiling =====
First set the variables
* $packages set to VALET packages
* $libs set to libraries
* $f90flags set to compiler flags
Since this test is completely contained in one Fortran 90 program you can compile with one compile, link and load with one command.
vpkg_rollback all
vpkg_devrequire $packages
mpif90 $f90flags -o solve linsolve.f90 $LDFLAGS $libs
Some version of the ''mpif90'' wrapper should be add to you environment, either explicitly or implicitly by VALET. Also VALET will set the LDFLAGS for your compile statement.
===== Grid engine script file =====
You must run the ''solve'' executable on the compute nodes, and submit with Grid Engine. You will need
a script, which we will copy
from ''/opt/shared/templates/openmpi/openmpi-ibverb.qs'' and modified so
* $MY_EXEC: ''./solve < in.dat''
* NPROC: ''$NPROC_ROWS*$NPROC_COLS''
* vpkg_require line includes the Valet packages for the benchmark.
For example, with the ''packages'', ''NPROC_ROWS'' and ''NPROC_COLS'' variables set as above:
let NPROC=$NPROC_ROWS*$NPROC_COLS
if [ ! -f "template.qs" ]; then
sed -e 's/\$MY_EXE/.\/solve < in.dat/g;s/^echo "\(.*\)--"/date "+\1 %D %T--"/' \
/opt/shared/templates/openmpi/openmpi-ibverb.qs > template.qs
echo "new copy of template in template.qs"
fi
sed "s/NPROC/$NPROC/g;s#^vpkg_require.*#vpkg_require $packages#" template.qs > solve.qs
The file ''solve.qs'' will contain a script file to run the executable ''./solve'' that reads the data from ''in.dat''.
Also ''$NPROCS'' is set the number of processors required on handle ''$NPROC_ROWS x $NPROC_COLS'' blocks.
There is only one executable, ''solve'', one data file ''in.dat'' and one batch script file ''solve.qs''. There may be different test versions to test alternate compilers, libraries and problem sizes. But you should not change these files until the Grid Engine jobs are done. Use the ''qsub -sync y'' command to wait for all jobs to complete.
===== Submitting =====
There is only linear system solve, and it should take just a few seconds. We will use the under 4hr standby queue for testing. The variables ''name'' and ''N'' are set as described above.
qsub -N $name$N -l standby=1 -l h_rt=04:00:00 solve.qs
===== Tests =====
==== gcc ====
name=gcc
packages=scalapack/2.0.1-openmpi-gcc
libs="-lscalapack -llapack -lblas"
f90flags=''
==== gcc and atlas ====
name=gcc_atlas
packages='scalapack atlas'
libs="-lscalapack -llapack -lcblas -lf77blas -latlas"
f90flags=''
The documentation in ''/opt/shared/atlas/3.6.0-gcc/doc/LibReadme.txt'' gives the loader flags to use to load the ''atlas'' libraries, the are: ''-LLIBDIR -llapack -lcblas -lf77blas -latlas''. The ''-L'' directory is in the ''LDFLAGS'', which is supplied by VALET.
Also from the same documentation:
ATLAS does not provide a full LAPACK library.
This means the order the VALET packages are added is important. If we add ''atlas'' before ''scalapack'', then the loader will not be able to find some LAPACK routines.
But this may not be optimal:
Just linking in ATLAS's liblapack.a first will not get you the best LAPACK
performance, mainly because LAPACK's untuned ILAENV will be used instead
of ATLAS's tuned one.
With these variables set and ''packages'' changed to
packages='scalapack atlas'
we get ''ld'' errors:
...
/opt/shared/atlas/3.6.0-gcc/lib/Linux_BULLDOZER/libf77blas.a(xerbla.o): In function `xerbla_':
xerbla.f:(.text+0x18): undefined reference to `s_wsfe'
...
Explanation: the funcion ''xerbla_'' in the library 'f77blas' was compiled with ''g77'', not ''gfortran''. So it needs the support routines in the ''g2c'' library. Knowing the name of this library wech find a copy:
find /usr -name libg2c.a
find: `/usr/lib64/audit': Permission denied
/usr/lib/gcc/x86_64-redhat-linux/3.4.6/32/libg2c.a
/usr/lib/gcc/x86_64-redhat-linux/3.4.6/libg2c.a
To remove these errors, change:
libs="-lscalapack -llapack -lcblas -lf77blas -latlas -L/usr/lib/gcc/x86_64-redhat-linux/3.4.6 -lg2c"
New ''ld'' errors:
...
linsolve.f90:(.text+0xd00): undefined reference to `slarnv_'
...
Explanation: The ''atlas'' version of ''lapack'' does not have all of the LAPACK routines.
nm -g /opt/shared/atlas/3.6.0-gcc/lib/Linux_BULLDOZER/liblapack.a | grep slarnv_
nm -g /opt/shared/scalapack/2.0.1-openmpi-gcc/lib/liblapack.a | grep slarnv_
U slarnv_
0000000000000000 T slarnv_
U slarnv_
U slarnv_
No output from first ''nm'' command.
You can copy the full atlas directory in your working direction and then follow the directions
in ''/opt/shared/atlas/3.6.0-gcc/doc/LibReadme.txt'', section:
**** GETTING A FULL LAPACK LIB ****
We call this library ''myatlas''.
==== gcc and myatlas ====
name=gcc_myatlas
packages='scalapack'
libs="-lscalapack -L./lib -llapack -lcblas -lf77blas -latlas"
f90flags=''
This requires a copy of atlas in your directory, ''./lib'' and uses the ''g2c'' lib in the ''/usr/lib'' directory.
You need to build your own copy of ''atlas''.
Assuming
you do not have a ''lib'' directory in your working directory:
cp -a /opt/shared/atlas/3.6.0-gcc/lib/Linux_BULLDOZER lib
ar x lib/liblapack.a
cp /opt/shared/scalapack/2.0.1-openmpi-gcc/lib/liblapack.a lib
ar r lib/liblapack.a *.o
rm *.o
cp /usr/lib/gcc/x86_64-redhat-linux/3.4.6/libg2c.a lib
Now you have a ''./lib'' directory with your own copy of the ''atlas'' libraries, and the ''g2c'' library. You no longer need to specify atlas VALET package.
==== gcc and myptatlas ====
name=gcc_myptatlas
packages='scalapack'
libs="-lscalapack -L./lib -llapack -lptcblas -lptf77blas -latlas"
f90flags=''
Parallel threads will dynamically uses all the cores available at compile time (24), but only if problem size indicates they will help.
==== pgi and acml ====
name=pgi_acml
packages=scalapack/2.0.1-openmpi-pgi
libs="-lscalapack -lacml"
f90flags=''
==== intel and mkl ====
name=intel_mkl
packages=openmpi/1.4.4-intel64
libs="-lmkl_scalapack_lp64 -lmkl_blacs_openmpi_lp64"
f90flags="-mkl"
===== Results N=4000=====
==== BLOCK=1000, NPROCS=16 ====
Each test is repeated three time.
^ File name ^ Time ^
| gcc4000.o91943 | Elapsed time = 0.613728D+05 milliseconds |
| gcc4000.o92019 | Elapsed time = 0.862935D+05 milliseconds |
| gcc4000.o92030 | Elapsed time = 0.826695D+05 milliseconds |
| gcc_atlas4000.o91945 | Elapsed time = 0.386161D+04 milliseconds |
| gcc_atlas4000.o92023 | Elapsed time = 0.433195D+04 milliseconds |
| gcc_atlas4000.o92035 | Elapsed time = 0.424980D+04 milliseconds |
| gcc_myatlas4000.o92009 | Elapsed time = 0.448106D+04 milliseconds |
| gcc_myatlas4000.o92026 | Elapsed time = 0.461706D+04 milliseconds |
| gcc_myatlas4000.o92032 | Elapsed time = 0.441593D+04 milliseconds |
| intel_mkl4000.o91611 | Elapsed time = 0.222194D+05 milliseconds |
| intel_mkl4000.o92016 | Elapsed time = 0.215223D+05 milliseconds |
| intel_mkl4000.o92039 | Elapsed time = 0.214088D+05 milliseconds |
| pgi_acml4000.o91466 | Elapsed time = 0.238015D+04 milliseconds |
| pgi_acml4000.o92017 | Elapsed time = 0.261841D+04 milliseconds |
| pgi_acml4000.o92040 | Elapsed time = 0.222939D+04 milliseconds |
==== BLOCK=800, NPROCS=25 ====
Each test is repeated three time.
^ File name ^ Time ^
| gcc4000.o92335 | Elapsed time = 0.638246D+05 milliseconds |
| gcc4000.o92386 | Elapsed time = 0.633060D+05 milliseconds |
| gcc4000.o92412 | Elapsed time = 0.629561D+05 milliseconds |
| gcc_atlas4000.o92336 | Elapsed time = 0.314615D+04 milliseconds |
| gcc_atlas4000.o92389 | Elapsed time = 0.358208D+04 milliseconds |
| gcc_atlas4000.o92413 | Elapsed time = 0.334147D+04 milliseconds |
| gcc_myatlas4000.o92337 | Elapsed time = 0.363176D+04 milliseconds |
| gcc_myatlas4000.o92390 | Elapsed time = 0.306922D+04 milliseconds |
| gcc_myatlas4000.o92414 | Elapsed time = 0.333779D+04 milliseconds |
| intel_mkl4000.o92339 | Elapsed time = 0.433877D+05 milliseconds |
| intel_mkl4000.o92393 | Elapsed time = 0.400862D+05 milliseconds |
| intel_mkl4000.o92417 | Elapsed time = 0.409855D+05 milliseconds |
| pgi_acml4000.o92338 | Elapsed time = 0.234248D+04 milliseconds |
| pgi_acml4000.o92392 | Elapsed time = 0.276856D+04 milliseconds |
| pgi_acml4000.o92415 | Elapsed time = 0.211567D+04 milliseconds |
==== BLOCK=500, NPROCS=64 ====
Each test is repeated three time.
^ File name ^ Time ^
| gcc4000.o92123 | Elapsed time = 0.284893D+05 milliseconds |
| gcc4000.o92144 | Elapsed time = 0.278744D+05 milliseconds |
| gcc4000.o92150 | Elapsed time = 0.289137D+05 milliseconds |
| gcc_atlas4000.o92130 | Elapsed time = 0.296471D+04 milliseconds |
| gcc_atlas4000.o92142 | Elapsed time = 0.264463D+04 milliseconds |
| gcc_atlas4000.o92148 | Elapsed time = 0.269103D+04 milliseconds |
| gcc_myatlas4000.o92133 | Elapsed time = 0.280457D+04 milliseconds |
| gcc_myatlas4000.o92138 | Elapsed time = 0.312135D+04 milliseconds |
| gcc_myatlas4000.o92153 | Elapsed time = 0.286337D+04 milliseconds |
| intel_mkl4000.o92134 | Elapsed time = 0.436288D+05 milliseconds |
| intel_mkl4000.o92140 | Elapsed time = 0.413780D+05 milliseconds |
| intel_mkl4000.o92152 | Elapsed time = 0.401095D+05 milliseconds |
| pgi_acml4000.o92137 | Elapsed time = 0.234475D+04 milliseconds |
| pgi_acml4000.o92145 | Elapsed time = 0.214514D+04 milliseconds |
| pgi_acml4000.o92149 | Elapsed time = 0.293480D+04 milliseconds |
==== BLOCK=250, NPROCS=256 ====
Each test is repeated three time.
^ File name ^ Time ^
| gcc4000.o92164 | Elapsed time = 0.148302D+05 milliseconds |
| gcc4000.o92168 | Elapsed time = 0.144862D+05 milliseconds |
| gcc4000.o92317 | Elapsed time = 0.160144D+05 milliseconds |
| gcc_atlas4000.o92167 | Elapsed time = 0.785104D+04 milliseconds |
| gcc_atlas4000.o92171 | Elapsed time = 0.749285D+04 milliseconds |
| gcc_atlas4000.o92318 | Elapsed time = 0.798376D+04 milliseconds |
| gcc_myatlas4000.o92165 | Elapsed time = 0.797618D+04 milliseconds |
| gcc_myatlas4000.o92222 | Elapsed time = 0.792745D+04 milliseconds |
| gcc_myatlas4000.o92320 | Elapsed time = 0.720193D+04 milliseconds |
| intel_mkl4000.o92162 | Elapsed time = 0.636915D+05 milliseconds |
| intel_mkl4000.o92169 | Elapsed time = 0.733785D+05 milliseconds |
| intel_mkl4000.o92324 | Elapsed time = 0.653791D+05 milliseconds |
| pgi_acml4000.o92161 | Elapsed time = 0.740457D+04 milliseconds |
| pgi_acml4000.o92170 | Elapsed time = 0.733668D+04 milliseconds |
| pgi_acml4000.o92322 | Elapsed time = 0.769606D+04 milliseconds |
===== Summary =====
==== 4000 x 4000 matrix ====
=== Time to solve linear system ===
A randomly generated matrix is solved using ScaLAPACK with different block sizes.
The times are the average elapsed time in seconds, as reported by ''MPI_WTIME'', over three batch job submissions.
^ Test ^ N=4000 ^^^^
^ name ^ np=16 ^ np=25 ^ np=64 ^ np=256 ^
| [[#gcc|gcc]] | 76.779 | 63.362 | 28.426 | 15.110 |
| [[#gcc_and_atlas|gcc_atlas]] | 4.148 | 3.357 | 2.767 | 7.776 |
| [[#gcc_and_myatlas|gcc_myatlas]] | 4.505 | 3.346 | 2.930 | 7.702 |
| [[#intel_and_mkl|intel_mkl]] | 21.717 | 41.486 | 41.705 | 67.483 |
| [[#pgi_and_acml|pgi_acml]] | 2.409 | 2.409 | 2.475 | 7.479 |
There is not much difference between ''gcc_atlas'' and ''gcc_myatlas''.
=== Speedup ===
The speedup for ''ATLAS'', ''MKL'' and ''ACML'' are relative to ''GCC'' with no special BLAS libraries.
{{:clusters:mills:speedup.png?500|}}
==== 16000 x 16000 matrix ====
=== Time to solve linear system ===
A randomly generated matrix is solved using ScaLAPACK with different block sizes.
The times are the average elapsed time in seconds, as reported by ''MPI_WTIME'', over two batch job submissions.
^ Test ^ N=16000 ^^^
^ name^ np=16^ np=64^ np=256 ^
| [[#gcc|gcc]] | 5627.655 | 1682.308 | 474.087 |
| [[#gcc-and-atlas|gcc_atlas]] | 231.552 | 81.281 | 44.890 |
| [[#gcc-and-myatlas|gcc_myatlas]] | 235.682 | 88.871 | 39.559 |
| [[#gcc-and-myptatlas|gcc_myptatlas]] | 218.815 | 79.013 | 47.388 |
| [[#intel-and-mkl|intel_mkl]] | 132.071 | 174.859 | 224.309 |
| [[#pgi-and-acml|pgi_acml]] | 161.941 | 64.373 | 34.463 |
=== Speedup ===
Speedup for ATLAS, MKL and ACML compared to the reference GCC with no optimized library. ATLAS has two variants: ''myatlas'' has the complete LAPACK merged in, and ''myptatlas'' loads the multi-threaded versions of the libraries.
{{:clusters:mills:speedup16000.png?500|}}
=== Time plot ===
Elapsed time for ATLAS, MKL and ACML.
{{:clusters:mills:plot16000.png?500|}}