Laplace examples from XSEDE HPC Workshops
The XSEDE HPC workshop focused on taking a serial version of the Laplace example and developing three parallel solutions for OpenMP, MPI and OpenACC. This documentation describes how to compile and run the serial and three parallel solutions of the Laplace example on Farber using different compilers.
Copy the examples to your home directory on Farber with the following command:
cp -r ~trainf/Exercises .
-l exclusive=1
with qlogin
to prevent other jobs from running on your node during your interactive development.
Serial
GCC Fortran and C
Commands to type after login and starting a workgroup shell for GCC Fortan:
qlogin cd Exercises/Serial/ vpkg_devrequire gcc/4.9 gfortran laplace_serial.f90 time ./a.out exit
The Total time reported by the program is the user CPU time. The time command will also give you the wall clock time - real
[(it_css:trainf)@n038 Serial]$ time ./a.out Maximum iterations [100-4000]? 4000 ---------- Iteration number: 100 --------------- ( 995, 995): 63.33 ( 996, 996): 72.67 ( 997, 997): 81.40 ( 998, 998): 88.97 ( 999, 999): 94.86 (1000,1000): 98.67 -------- more iteration progress reports -------- ---------- Iteration number: 3300 --------------- ( 995, 995): 97.66 ( 996, 996): 98.24 ( 997, 997): 98.75 ( 998, 998): 99.19 ( 999, 999): 99.56 (1000,1000): 99.87 Max error at iteration 3372 was 9.99533103575345194E-003 Total time was 34.876698 seconds. real 0m39.333s user 0m34.876s sys 0m0.002s
Commands to type after login and starting a workgroup shell for GCC C:
qlogin cd Exercises/Serial/ vpkg_devrequire gcc/4.9 gcc laplace_serial.c -lm time ./a.out exit
The Total time reported by the program is the user CPU time. The time command will also give you the wall clock time - real
[(it_css:trainf)@n038 Serial]$ time ./a.out Maximum iterations [100-4000]? 4000 ---------- Iteration number: 100 --------------- [995,995]: 63.33 [996,996]: 72.67 [997,997]: 81.40 [998,998]: 88.97 [999,999]: 94.86 [1000,1000]: 98.67 -------- more iteration progress reports -------- ---------- Iteration number: 3300 --------------- [995,995]: 97.66 [996,996]: 98.24 [997,997]: 98.75 [998,998]: 99.19 [999,999]: 99.56 [1000,1000]: 99.87 Max error at iteration 3372 was 0.009995 Total time was 32.101587 seconds. real 0m34.803s user 0m32.123s sys 0m0.004s
gcc
does not do any optimization. Adding the options -O3 -ffast-math
will produce similar results like the Intel compilers.qlogin cd Exercises/Serial/ vpkg_devrequire gcc/4.9 gfortran -O3 -ffast-math laplace_serial.f90 time ./a.out exit
The Total time reported by the program is the user CPU time. The time command will also give you the wall clock time - real
[(it_css:trainf)@n038 Serial]$ time ./a.out Maximum iterations [100-4000]? 4000 ---------- Iteration number: 100 --------------- ( 995, 995): 63.33 ( 996, 996): 72.67 ( 997, 997): 81.40 ( 998, 998): 88.97 ( 999, 999): 94.86 (1000,1000): 98.67 -------- more iteration progress reports -------- ---------- Iteration number: 3300 --------------- ( 995, 995): 97.66 ( 996, 996): 98.24 ( 997, 997): 98.75 ( 998, 998): 99.19 ( 999, 999): 99.56 (1000,1000): 99.87 Max error at iteration 3372 was 9.9953310357463465E-003 Total time was 4.87525797 seconds. real 0m7.105s user 0m4.871s sys 0m0.005s
Intel Fortran and C
Commands to type after login and starting a workgroup shell for Intel Fortran:
qlogin cd Exercises/Serial/ vpkg_devrequire intel/2016 ifort laplace_serial.f90 time ./a.out exit
The Total time reported by the program is the user CPU time. The time command will also give you the wall clock time - real
[(it_css:trainf)@n038 Serial]$ time ./a.out Maximum iterations [100-4000]? 4000 ---------- Iteration number: 100 --------------- ( 995, 995): 63.33 ( 996, 996): 72.67 ( 997, 997): 81.40 ( 998, 998): 88.97 ( 999, 999): 94.86 (1000,1000): 98.67 -------- more iteration progress reports -------- ---------- Iteration number: 3300 --------------- ( 995, 995): 97.66 ( 996, 996): 98.24 ( 997, 997): 98.75 ( 998, 998): 99.19 ( 999, 999): 99.56 (1000,1000): 99.87 Max error at iteration 3372 was 9.995331035753452E-003 Total time was 6.477016 seconds. real 0m8.816s user 0m6.473s sys 0m0.007s
Commands to type after login and starting a workgroup shell for Intel C:
qlogin cd Exercises/Serial/ vpkg_devrequire intel/2016 icc laplace_serial.f90 time ./a.out exit
The Total time reported by the program is the user CPU time. The time command will also give you the wall clock time - real
[(it_css:trainf)@n038 Serial]$ time ./a.out Maximum iterations [100-4000]? 4000 ---------- Iteration number: 100 --------------- [995,995]: 63.33 [996,996]: 72.67 [997,997]: 81.40 [998,998]: 88.97 [999,]: 94.86 [1000,1000]: 98.67 -------- more iteration progress reports -------- ---------- Iteration number: 3300 --------------- [995,995]: 97.66 [996,996]: 98.24 [997,997]: 98.75 [998,998]: 99.19 [999,999]: 99.56 [1000,1000]: 99.87 Max error at iteration 3372 was 0.009995 Total time was 17.156667 seconds. real 0m19.921s user 0m17.162s sys 0m0.009s
-O0
option (capital letter 'O' followed by the number zero '0') when using the Intel compilers to compile with no optimizations especially when you are debugging and testing your code for correctness.
OpenMP
Commands to type after login and starting a workgroup shell:
workgroup -g it_css cd Exercises/OpenMP/Solutions/ qlogin -pe threads 4 export OMP_NUM_THREADS=4 vpkg_devrequire gcc/4.9 gfortran -fopenmp laplace_omp.f90 time ./a.out exit
The Total time reported by the program is the user CPU time. The time command will also give you the wall clock time - real
[(it_css:trainf)@n036 Solutions]$ time ./a.out Maximum iterations [100-4000]? 4000 ---------- Iteration number: 100 --------------- ( 995, 995): 63.33 ( 996, 996): 72.67 ( 997, 997): 81.40 ( 998, 998): 88.97 ( 999, 999): 94.86 (1000,1000): 98.67 -------- more iteration progress reports -------- ---------- Iteration number: 3300 --------------- ( 995, 995): 97.66 ( 996, 996): 98.24 ( 997, 997): 98.75 ( 998, 998): 99.19 ( 999, 999): 99.56 (1000,1000): 99.87 Max error at iteration 3372 was 9.9953310357534519E-003 Total time was 46.3349571 seconds. real 0m16.459s user 0m46.331s sys 0m0.009s
Same for C, with the compile statement:
gcc -fopenmp laplace_omp.c -lm
workgroup -g it_css cd Exercises/OpenMP/Solutions/ qlogin -pe threads 4 export OMP_NUM_THREADS=4 vpkg_devrequire intel/2016 ifort -qopenmp laplace_omp.f90 time ./a.out exit
- The option
-openmp
was depreciated in intel/2016, and replaced by-qopenmp
.
Same for Intel C, with the compile statement:
icc -qopenmp laplace_omp.c
MPI
Commands to type after login and starting a workgroup shell:
qlogin -pe mpi 4 cd Exercises/MPI/Solutions/ vpkg_require openmpi mpifort laplace_mpi.f90 mpirun -np 4 ./a.out exit
The Total time reported is from a high resolution wall clock timer. (Part of the MPI specifications)
[(it_css:trainf)@n036 Solutions]$ mpirun -np 4 ./a.out Maximum iterations [100-4000]? 4000 ---------- Iteration number: 100 --------------- ( 995, 995): 63.33 ( 996, 996): 72.67 ( 997, 997): 81.40 ( 998, 998): 88.97 ( 999, 999): 94.86 (1000,1000): 98.67 -------- more iteration progress reports -------- ---------- Iteration number: 3300 --------------- ( 995, 995): 97.66 ( 996, 996): 98.24 ( 997, 997): 98.75 ( 998, 998): 99.19 ( 999, 999): 99.56 (1000,1000): 99.87 Max error at iteration 3372 was 9.99533095416893502E-003 Total time was 9.3715754 seconds.
Same for C, with the compile statement:
mpicc laplace_mpi.c -lm
OpenACC
This must be run on a node with a GPU accelerator card.
Commands to type after login and starting a workgroup shell:
qlogin -l gpu cd Exercises/OpenACC/Solutions/ vpkg_devrequire pgi/16 pgf90 -acc -ta=tesla laplace_acc.f90 time ./a.out exit
The Total time reported by the program is the time on the GPU. The time command will give you the wall clock time - real
[(it_css:trainf)@n036 Solutions]$ time ./a.out Maximum iterations [100-4000]? 4000 ---------- Iteration number: 100 --------------- ( 995, 995): 63.33 ( 996, 996): 72.67 ( 997, 997): 81.40 ( 998, 998): 88.97 ( 999, 999): 94.86 (1000,1000): 98.67 -------- more iteration progress reports -------- ---------- Iteration number: 3300 --------------- ( 995, 995): 97.66 ( 996, 996): 98.24 ( 997, 997): 98.75 ( 998, 998): 99.19 ( 999, 999): 99.56 (1000,1000): 99.87 Max error at iteration 3372 was 9.9953310357534519E-003 Total time was 1.192778 seconds. real 0m5.095s user 0m0.891s sys 0m0.371s
Same for C, with the compile statement:
pgcc -acc -ta=tesla laplace_acc.c