===== Laplace examples from XSEDE HPC Workshops =====
The XSEDE HPC workshop focused on taking a serial version of the Laplace example and developing three parallel solutions for OpenMP, MPI and OpenACC. This documentation describes how to compile and run the serial and three parallel solutions of the Laplace example on Farber using different compilers.
Copy the examples to your home directory on Farber with the following command:
cp -r ~trainf/Exercises .
If benchmarking, consider using ''-l exclusive=1'' with ''qlogin'' to prevent other jobs from running on your node during your interactive development.
==== Serial ====
=== GCC Fortran and C ===
Commands to type after login and starting a workgroup shell for GCC Fortan:
qlogin
cd Exercises/Serial/
vpkg_devrequire gcc/4.9
gfortran laplace_serial.f90
time ./a.out
exit
The Total time reported by the program is the user CPU time. The time command will also give you the wall clock time - //real//
[(it_css:trainf)@n038 Serial]$ time ./a.out
Maximum iterations [100-4000]?
4000
---------- Iteration number: 100 ---------------
( 995, 995): 63.33 ( 996, 996): 72.67 ( 997, 997): 81.40 ( 998, 998): 88.97 ( 999, 999): 94.86 (1000,1000): 98.67
-------- more iteration progress reports --------
---------- Iteration number: 3300 ---------------
( 995, 995): 97.66 ( 996, 996): 98.24 ( 997, 997): 98.75 ( 998, 998): 99.19 ( 999, 999): 99.56 (1000,1000): 99.87
Max error at iteration 3372 was 9.99533103575345194E-003
Total time was 34.876698 seconds.
real 0m39.333s
user 0m34.876s
sys 0m0.002s
Commands to type after login and starting a workgroup shell for GCC C:
qlogin
cd Exercises/Serial/
vpkg_devrequire gcc/4.9
gcc laplace_serial.c -lm
time ./a.out
exit
The Total time reported by the program is the user CPU time. The time command will also give you the wall clock time - //real//
[(it_css:trainf)@n038 Serial]$ time ./a.out
Maximum iterations [100-4000]?
4000
---------- Iteration number: 100 ---------------
[995,995]: 63.33 [996,996]: 72.67 [997,997]: 81.40 [998,998]: 88.97 [999,999]: 94.86 [1000,1000]: 98.67
-------- more iteration progress reports --------
---------- Iteration number: 3300 ---------------
[995,995]: 97.66 [996,996]: 98.24 [997,997]: 98.75 [998,998]: 99.19 [999,999]: 99.56 [1000,1000]: 99.87
Max error at iteration 3372 was 0.009995
Total time was 32.101587 seconds.
real 0m34.803s
user 0m32.123s
sys 0m0.004s
By default ''gcc'' does not do any optimization. Adding the options ''-O3 -ffast-math'' will produce similar results like the Intel compilers.
qlogin
cd Exercises/Serial/
vpkg_devrequire gcc/4.9
gfortran -O3 -ffast-math laplace_serial.f90
time ./a.out
exit
The Total time reported by the program is the user CPU time. The time command will also give you the wall clock time - //real//
[(it_css:trainf)@n038 Serial]$ time ./a.out
Maximum iterations [100-4000]?
4000
---------- Iteration number: 100 ---------------
( 995, 995): 63.33 ( 996, 996): 72.67 ( 997, 997): 81.40 ( 998, 998): 88.97 ( 999, 999): 94.86 (1000,1000): 98.67
-------- more iteration progress reports --------
---------- Iteration number: 3300 ---------------
( 995, 995): 97.66 ( 996, 996): 98.24 ( 997, 997): 98.75 ( 998, 998): 99.19 ( 999, 999): 99.56 (1000,1000): 99.87
Max error at iteration 3372 was 9.9953310357463465E-003
Total time was 4.87525797 seconds.
real 0m7.105s
user 0m4.871s
sys 0m0.005s
=== Intel Fortran and C ===
Commands to type after login and starting a workgroup shell for Intel Fortran:
qlogin
cd Exercises/Serial/
vpkg_devrequire intel/2016
ifort laplace_serial.f90
time ./a.out
exit
The Total time reported by the program is the user CPU time. The time command will also give you the wall clock time - //real//
[(it_css:trainf)@n038 Serial]$ time ./a.out
Maximum iterations [100-4000]?
4000
---------- Iteration number: 100 ---------------
( 995, 995): 63.33 ( 996, 996): 72.67 ( 997, 997): 81.40 ( 998, 998): 88.97 ( 999, 999): 94.86 (1000,1000): 98.67
-------- more iteration progress reports --------
---------- Iteration number: 3300 ---------------
( 995, 995): 97.66 ( 996, 996): 98.24 ( 997, 997): 98.75 ( 998, 998): 99.19 ( 999, 999): 99.56 (1000,1000): 99.87
Max error at iteration 3372 was 9.995331035753452E-003
Total time was 6.477016 seconds.
real 0m8.816s
user 0m6.473s
sys 0m0.007s
Commands to type after login and starting a workgroup shell for Intel C:
qlogin
cd Exercises/Serial/
vpkg_devrequire intel/2016
icc laplace_serial.f90
time ./a.out
exit
The Total time reported by the program is the user CPU time. The time command will also give you the wall clock time - //real//
[(it_css:trainf)@n038 Serial]$ time ./a.out
Maximum iterations [100-4000]?
4000
---------- Iteration number: 100 ---------------
[995,995]: 63.33 [996,996]: 72.67 [997,997]: 81.40 [998,998]: 88.97 [999,]: 94.86 [1000,1000]: 98.67
-------- more iteration progress reports --------
---------- Iteration number: 3300 ---------------
[995,995]: 97.66 [996,996]: 98.24 [997,997]: 98.75 [998,998]: 99.19 [999,999]: 99.56 [1000,1000]: 99.87
Max error at iteration 3372 was 0.009995
Total time was 17.156667 seconds.
real 0m19.921s
user 0m17.162s
sys 0m0.009s
Consider using ''-O0'' option (capital letter 'O' followed by the number zero '0') when using the Intel compilers to compile with no optimizations especially when you are debugging and testing your code for correctness.
==== OpenMP ====
Commands to type after login and starting a workgroup shell:
workgroup -g it_css
cd Exercises/OpenMP/Solutions/
qlogin -pe threads 4
export OMP_NUM_THREADS=4
vpkg_devrequire gcc/4.9
gfortran -fopenmp laplace_omp.f90
time ./a.out
exit
The Total time reported by the program is the user CPU time. The time command will also give you the wall clock time - //real//
[(it_css:trainf)@n036 Solutions]$ time ./a.out
Maximum iterations [100-4000]?
4000
---------- Iteration number: 100 ---------------
( 995, 995): 63.33 ( 996, 996): 72.67 ( 997, 997): 81.40 ( 998, 998): 88.97 ( 999, 999): 94.86 (1000,1000): 98.67
-------- more iteration progress reports --------
---------- Iteration number: 3300 ---------------
( 995, 995): 97.66 ( 996, 996): 98.24 ( 997, 997): 98.75 ( 998, 998): 99.19 ( 999, 999): 99.56 (1000,1000): 99.87
Max error at iteration 3372 was 9.9953310357534519E-003
Total time was 46.3349571 seconds.
real 0m16.459s
user 0m46.331s
sys 0m0.009s
Same for C, with the compile statement:
gcc -fopenmp laplace_omp.c -lm
Intel compilers
workgroup -g it_css
cd Exercises/OpenMP/Solutions/
qlogin -pe threads 4
export OMP_NUM_THREADS=4
vpkg_devrequire intel/2016
ifort -qopenmp laplace_omp.f90
time ./a.out
exit
* The option ''-openmp'' was depreciated in intel/2016, and replaced by ''-qopenmp''.
Same for Intel C, with the compile statement:
icc -qopenmp laplace_omp.c
==== MPI ====
Commands to type after login and starting a workgroup shell:
qlogin -pe mpi 4
cd Exercises/MPI/Solutions/
vpkg_require openmpi
mpifort laplace_mpi.f90
mpirun -np 4 ./a.out
exit
The Total time reported is from a high resolution wall clock timer. (Part of the MPI specifications)
[(it_css:trainf)@n036 Solutions]$ mpirun -np 4 ./a.out
Maximum iterations [100-4000]?
4000
---------- Iteration number: 100 ---------------
( 995, 995): 63.33 ( 996, 996): 72.67 ( 997, 997): 81.40 ( 998, 998): 88.97 ( 999, 999): 94.86 (1000,1000): 98.67
-------- more iteration progress reports --------
---------- Iteration number: 3300 ---------------
( 995, 995): 97.66 ( 996, 996): 98.24 ( 997, 997): 98.75 ( 998, 998): 99.19 ( 999, 999): 99.56 (1000,1000): 99.87
Max error at iteration 3372 was 9.99533095416893502E-003
Total time was 9.3715754 seconds.
Same for C, with the compile statement:
mpicc laplace_mpi.c -lm
==== OpenACC ====
This must be run on a node with a GPU accelerator card.
Commands to type after login and starting a workgroup shell:
qlogin -l gpu
cd Exercises/OpenACC/Solutions/
vpkg_devrequire pgi/16
pgf90 -acc -ta=tesla laplace_acc.f90
time ./a.out
exit
The Total time reported by the program is the time on the GPU. The time command will give you the wall clock time - //real//
[(it_css:trainf)@n036 Solutions]$ time ./a.out
Maximum iterations [100-4000]?
4000
---------- Iteration number: 100 ---------------
( 995, 995): 63.33 ( 996, 996): 72.67 ( 997, 997): 81.40 ( 998, 998): 88.97 ( 999, 999): 94.86 (1000,1000): 98.67
-------- more iteration progress reports --------
---------- Iteration number: 3300 ---------------
( 995, 995): 97.66 ( 996, 996): 98.24 ( 997, 997): 98.75 ( 998, 998): 99.19 ( 999, 999): 99.56 (1000,1000): 99.87
Max error at iteration 3372 was 9.9953310357534519E-003
Total time was 1.192778 seconds.
real 0m5.095s
user 0m0.891s
sys 0m0.371s
Same for C, with the compile statement:
pgcc -acc -ta=tesla laplace_acc.c