===== Laplace examples from XSEDE HPC Workshops ===== The XSEDE HPC workshop focused on taking a serial version of the Laplace example and developing three parallel solutions for OpenMP, MPI and OpenACC. This documentation describes how to compile and run the serial and three parallel solutions of the Laplace example on Farber using different compilers. Copy the examples to your home directory on Farber with the following command: cp -r ~trainf/Exercises . If benchmarking, consider using ''-l exclusive=1'' with ''qlogin'' to prevent other jobs from running on your node during your interactive development. ==== Serial ==== === GCC Fortran and C === Commands to type after login and starting a workgroup shell for GCC Fortan: qlogin cd Exercises/Serial/ vpkg_devrequire gcc/4.9 gfortran laplace_serial.f90 time ./a.out exit The Total time reported by the program is the user CPU time. The time command will also give you the wall clock time - //real// [(it_css:trainf)@n038 Serial]$ time ./a.out Maximum iterations [100-4000]? 4000 ---------- Iteration number: 100 --------------- ( 995, 995): 63.33 ( 996, 996): 72.67 ( 997, 997): 81.40 ( 998, 998): 88.97 ( 999, 999): 94.86 (1000,1000): 98.67 -------- more iteration progress reports -------- ---------- Iteration number: 3300 --------------- ( 995, 995): 97.66 ( 996, 996): 98.24 ( 997, 997): 98.75 ( 998, 998): 99.19 ( 999, 999): 99.56 (1000,1000): 99.87 Max error at iteration 3372 was 9.99533103575345194E-003 Total time was 34.876698 seconds. real 0m39.333s user 0m34.876s sys 0m0.002s Commands to type after login and starting a workgroup shell for GCC C: qlogin cd Exercises/Serial/ vpkg_devrequire gcc/4.9 gcc laplace_serial.c -lm time ./a.out exit The Total time reported by the program is the user CPU time. The time command will also give you the wall clock time - //real// [(it_css:trainf)@n038 Serial]$ time ./a.out Maximum iterations [100-4000]? 4000 ---------- Iteration number: 100 --------------- [995,995]: 63.33 [996,996]: 72.67 [997,997]: 81.40 [998,998]: 88.97 [999,999]: 94.86 [1000,1000]: 98.67 -------- more iteration progress reports -------- ---------- Iteration number: 3300 --------------- [995,995]: 97.66 [996,996]: 98.24 [997,997]: 98.75 [998,998]: 99.19 [999,999]: 99.56 [1000,1000]: 99.87 Max error at iteration 3372 was 0.009995 Total time was 32.101587 seconds. real 0m34.803s user 0m32.123s sys 0m0.004s By default ''gcc'' does not do any optimization. Adding the options ''-O3 -ffast-math'' will produce similar results like the Intel compilers. qlogin cd Exercises/Serial/ vpkg_devrequire gcc/4.9 gfortran -O3 -ffast-math laplace_serial.f90 time ./a.out exit The Total time reported by the program is the user CPU time. The time command will also give you the wall clock time - //real// [(it_css:trainf)@n038 Serial]$ time ./a.out Maximum iterations [100-4000]? 4000 ---------- Iteration number: 100 --------------- ( 995, 995): 63.33 ( 996, 996): 72.67 ( 997, 997): 81.40 ( 998, 998): 88.97 ( 999, 999): 94.86 (1000,1000): 98.67 -------- more iteration progress reports -------- ---------- Iteration number: 3300 --------------- ( 995, 995): 97.66 ( 996, 996): 98.24 ( 997, 997): 98.75 ( 998, 998): 99.19 ( 999, 999): 99.56 (1000,1000): 99.87 Max error at iteration 3372 was 9.9953310357463465E-003 Total time was 4.87525797 seconds. real 0m7.105s user 0m4.871s sys 0m0.005s === Intel Fortran and C === Commands to type after login and starting a workgroup shell for Intel Fortran: qlogin cd Exercises/Serial/ vpkg_devrequire intel/2016 ifort laplace_serial.f90 time ./a.out exit The Total time reported by the program is the user CPU time. The time command will also give you the wall clock time - //real// [(it_css:trainf)@n038 Serial]$ time ./a.out Maximum iterations [100-4000]? 4000 ---------- Iteration number: 100 --------------- ( 995, 995): 63.33 ( 996, 996): 72.67 ( 997, 997): 81.40 ( 998, 998): 88.97 ( 999, 999): 94.86 (1000,1000): 98.67 -------- more iteration progress reports -------- ---------- Iteration number: 3300 --------------- ( 995, 995): 97.66 ( 996, 996): 98.24 ( 997, 997): 98.75 ( 998, 998): 99.19 ( 999, 999): 99.56 (1000,1000): 99.87 Max error at iteration 3372 was 9.995331035753452E-003 Total time was 6.477016 seconds. real 0m8.816s user 0m6.473s sys 0m0.007s Commands to type after login and starting a workgroup shell for Intel C: qlogin cd Exercises/Serial/ vpkg_devrequire intel/2016 icc laplace_serial.f90 time ./a.out exit The Total time reported by the program is the user CPU time. The time command will also give you the wall clock time - //real// [(it_css:trainf)@n038 Serial]$ time ./a.out Maximum iterations [100-4000]? 4000 ---------- Iteration number: 100 --------------- [995,995]: 63.33 [996,996]: 72.67 [997,997]: 81.40 [998,998]: 88.97 [999,]: 94.86 [1000,1000]: 98.67 -------- more iteration progress reports -------- ---------- Iteration number: 3300 --------------- [995,995]: 97.66 [996,996]: 98.24 [997,997]: 98.75 [998,998]: 99.19 [999,999]: 99.56 [1000,1000]: 99.87 Max error at iteration 3372 was 0.009995 Total time was 17.156667 seconds. real 0m19.921s user 0m17.162s sys 0m0.009s Consider using ''-O0'' option (capital letter 'O' followed by the number zero '0') when using the Intel compilers to compile with no optimizations especially when you are debugging and testing your code for correctness. ==== OpenMP ==== Commands to type after login and starting a workgroup shell: workgroup -g it_css cd Exercises/OpenMP/Solutions/ qlogin -pe threads 4 export OMP_NUM_THREADS=4 vpkg_devrequire gcc/4.9 gfortran -fopenmp laplace_omp.f90 time ./a.out exit The Total time reported by the program is the user CPU time. The time command will also give you the wall clock time - //real// [(it_css:trainf)@n036 Solutions]$ time ./a.out Maximum iterations [100-4000]? 4000 ---------- Iteration number: 100 --------------- ( 995, 995): 63.33 ( 996, 996): 72.67 ( 997, 997): 81.40 ( 998, 998): 88.97 ( 999, 999): 94.86 (1000,1000): 98.67 -------- more iteration progress reports -------- ---------- Iteration number: 3300 --------------- ( 995, 995): 97.66 ( 996, 996): 98.24 ( 997, 997): 98.75 ( 998, 998): 99.19 ( 999, 999): 99.56 (1000,1000): 99.87 Max error at iteration 3372 was 9.9953310357534519E-003 Total time was 46.3349571 seconds. real 0m16.459s user 0m46.331s sys 0m0.009s Same for C, with the compile statement: gcc -fopenmp laplace_omp.c -lm Intel compilers workgroup -g it_css cd Exercises/OpenMP/Solutions/ qlogin -pe threads 4 export OMP_NUM_THREADS=4 vpkg_devrequire intel/2016 ifort -qopenmp laplace_omp.f90 time ./a.out exit * The option ''-openmp'' was depreciated in intel/2016, and replaced by ''-qopenmp''. Same for Intel C, with the compile statement: icc -qopenmp laplace_omp.c ==== MPI ==== Commands to type after login and starting a workgroup shell: qlogin -pe mpi 4 cd Exercises/MPI/Solutions/ vpkg_require openmpi mpifort laplace_mpi.f90 mpirun -np 4 ./a.out exit The Total time reported is from a high resolution wall clock timer. (Part of the MPI specifications) [(it_css:trainf)@n036 Solutions]$ mpirun -np 4 ./a.out Maximum iterations [100-4000]? 4000 ---------- Iteration number: 100 --------------- ( 995, 995): 63.33 ( 996, 996): 72.67 ( 997, 997): 81.40 ( 998, 998): 88.97 ( 999, 999): 94.86 (1000,1000): 98.67 -------- more iteration progress reports -------- ---------- Iteration number: 3300 --------------- ( 995, 995): 97.66 ( 996, 996): 98.24 ( 997, 997): 98.75 ( 998, 998): 99.19 ( 999, 999): 99.56 (1000,1000): 99.87 Max error at iteration 3372 was 9.99533095416893502E-003 Total time was 9.3715754 seconds. Same for C, with the compile statement: mpicc laplace_mpi.c -lm ==== OpenACC ==== This must be run on a node with a GPU accelerator card. Commands to type after login and starting a workgroup shell: qlogin -l gpu cd Exercises/OpenACC/Solutions/ vpkg_devrequire pgi/16 pgf90 -acc -ta=tesla laplace_acc.f90 time ./a.out exit The Total time reported by the program is the time on the GPU. The time command will give you the wall clock time - //real// [(it_css:trainf)@n036 Solutions]$ time ./a.out Maximum iterations [100-4000]? 4000 ---------- Iteration number: 100 --------------- ( 995, 995): 63.33 ( 996, 996): 72.67 ( 997, 997): 81.40 ( 998, 998): 88.97 ( 999, 999): 94.86 (1000,1000): 98.67 -------- more iteration progress reports -------- ---------- Iteration number: 3300 --------------- ( 995, 995): 97.66 ( 996, 996): 98.24 ( 997, 997): 98.75 ( 998, 998): 99.19 ( 999, 999): 99.56 (1000,1000): 99.87 Max error at iteration 3372 was 9.9953310357534519E-003 Total time was 1.192778 seconds. real 0m5.095s user 0m0.891s sys 0m0.371s Same for C, with the compile statement: pgcc -acc -ta=tesla laplace_acc.c