DARWIN Slurm Job Script Templates
As on Caviness, environment sense and setup code has been shifted out of the job script templates and into external script fragments that are sourced (executed) by the job script. What remains in the job script templates is the setting of variables that influence those external fragments' execute and the sourcing of them. When IT-RCI must change the behavior of job scripts, the external fragments are modified and the change is effected for all job scripts deriving from the templates.
Where Can I Find Them?
IT-RCI staff are maintaining the DARWIN templates via git. The production copy of the repository is checked-out in /opt/shared/slurm/templates
with a symbolic link present at /opt/shared/templates/slurm
to maintain parity with other HPC systems.
The external script fragments (mentioned above and discussed in detail below) can be found in the /opt/shared/slurm/templates/libexec
directory.
The /opt/shared/templates/slurm
symbolic link points to the /opt/shared/slurm/templates/job-scripts
directory, which is organized into distinct job classes. The top level is split into applications
and generic
.
Applications
Software packages that have unique runtime requirements will have a single script or a directory of scripts located in the applications
directory. TensorFlow is a good example: DARWIN uses Linux containers (created by Google and distributed via Docker) to execute TensorFlow scripts on compute nodes. Gaussian requires extra work to tailor input files and its own expected environment variables to match the resources allocated to the job by Slurm.
Generic
The application-specific job scripts are actually based on the generic scripts present in the generic
directory. Serial jobs can make use of the serial.qs
script; programs leveraging threaded (e.g. OpenMP) parallelism can use threads.qs
as a starting point.
The mpi
directory divides that programming paradigm into implementation-specific variants: mpich
, openmpi
, and generic
(a catch-all that uses machine files and should generally NOT be used).
Hierarchical Modularity
Environment setup tasks have been abstracted into each external script fragment file. Examining the fragment directory:
-rw-r--r-- 1 frey sysadmin 1733 Sep 12 2018 common.sh -rw-r--r-- 1 frey sysadmin 5621 May 6 14:11 gaussian.sh -rw-r--r-- 1 frey sysadmin 1580 Sep 12 2018 generic-mpi.sh -rw-r--r-- 1 frey sysadmin 1432 Sep 14 2018 mpich.sh -rw-r--r-- 1 frey sysadmin 4805 Sep 12 2018 openmpi.sh -rw-r--r-- 1 frey sysadmin 2209 May 6 14:11 openmp.sh
Signal Handling
One thing added to the job environment by the common.sh
fragment for jobs that register a preemption/timeout signal handler is the UD_EXEC
function. By default, if a Bash shell is currently executing a command it will not handle any signals until that command has completed executing. Consider a job script like this:
: cleanup() { echo "Time limit exceeded, scrubbing all junk files now" exit 0 } UD_JOB_EXIT_FN=cleanup . /opt/shared/slurm/templates/libexec/common.sh sleep 500000000
When this job is preempted, the SIGTERM
signal is delivered to the Bash shell. The shell notes this, but waits for the sleep
command to finish executing. Since that sleep will last 15.85 years, the 5 minute grace period given to jobs that are preempted or time out expires before the cleanup
function ever gets called and the job is killed.
In order to get signals to work asynchronously in the Bash shell, long-running commands must be run in the background. The UD_EXEC
function does just that:
: cleanup() { echo "Time limit exceeded, scrubbing all junk files now" exit 0 } UD_JOB_EXIT_FN=cleanup . /opt/shared/slurm/templates/libexec/common.sh UD_EXEC sleep 500000000
The sleep
command is executed in the background and when the SIGTERM
preemption/timeout signal is delivered, the shell immediately calls the cleanup
function as expected.