On both Mills and Farber, example job scripts were made available under the /opt/shared/templates/gridengine
directory. Each template contained extensive Bash code to sense and setup the job environment. Since the script templates were thus self-contained, they had no external dependencies.
One major problem with this paradigm comes when changes must be made to templates. Since users make copies of the template, they are responsible for merging any ongoing change into their copies. In practice, this simply does not happen: a working job script is reused over and over again without modification.
On Caviness, the job script templates have been structured differently. Most of the environment sense and setup code has been shifted out of the job script templates and into external script fragments that are sourced (executed) by the job script. What remains in the job script templates is the setting of variables that influence those external fragments' execute and the sourcing of them. Now, when IT must change the behavior of job scripts, the external fragments are modified and the change is effected for all job scripts deriving from the templates.
IT staff are maintaining the Caviness templates via git. The production copy of the repository is checked-out in /opt/shared/slurm/templates
with a symbolic link present at /opt/shared/templates/slurm
to maintain parity with the Mills and Farber systems.
The external script fragments (mentioned above and discussed in detail below) can be found in the /opt/shared/slurm/templates/libexec
directory.
The /opt/shared/templates/slurm
symbolic link points to the /opt/shared/slurm/templates/job-scripts
directory, which is organized into distinct job classes. The top level is split into applications
and generic
.
Software packages that have unique runtime requirements will have a single script or a directory of scripts located in the applications
directory. TensorFlow is a good example: Caviness uses Linux containers (created by Google and distributed via Docker) to execute TensorFlow scripts on compute nodes. Gaussian requires extra work to tailor input files and its own expected environment variables to match the resources allocated to the job by Slurm.
The application-specific job scripts are actually based on the generic scripts present in the generic
directory. Serial jobs can make use of the serial.qs
script; programs leveraging threaded (e.g. OpenMP) parallelism can use threads.qs
as a starting point.
The mpi
directory divides that programming paradigm into implementation-specific variants: mpich
, openmpi
, and generic
(a catch-all that uses machine files and should generally NOT be used).
Environment setup tasks have been abstracted into each external script fragment file. Examining the fragment directory:
-rw-r--r-- 1 frey sysadmin 1733 Sep 12 2018 common.sh -rw-r--r-- 1 frey sysadmin 5621 May 6 14:11 gaussian.sh -rw-r--r-- 1 frey sysadmin 1580 Sep 12 2018 generic-mpi.sh -rw-r--r-- 1 frey sysadmin 1432 Sep 14 2018 mpich.sh -rw-r--r-- 1 frey sysadmin 4805 Sep 12 2018 openmpi.sh -rw-r--r-- 1 frey sysadmin 2209 May 6 14:11 openmp.sh
One thing added to the job environment by the common.sh
fragment for jobs that register a preemption/timeout signal handler is the UD_EXEC
function. By default, if a Bash shell is currently executing a command it will not handle any signals until that command has completed executing. Consider a job script like this:
: cleanup() { echo "Time limit exceeded, scrubbing all junk files now" exit 0 } UD_JOB_EXIT_FN=cleanup . /opt/shared/slurm/templates/libexec/common.sh sleep 500000000
When this job is preempted, the SIGTERM
signal is delivered to the Bash shell. The shell notes this, but waits for the sleep
command to finish executing. Since that sleep will last 15.85 years, the 5 minute grace period given to jobs that are preempted or time out expires before the cleanup
function ever gets called and the job is killed.
In order to get signals to work asynchronously in the Bash shell, long-running commands must be run in the background. The UD_EXEC
function does just that:
: cleanup() { echo "Time limit exceeded, scrubbing all junk files now" exit 0 } UD_JOB_EXIT_FN=cleanup . /opt/shared/slurm/templates/libexec/common.sh UD_EXEC sleep 500000000
The sleep
command is executed in the background and when the SIGTERM
preemption/timeout signal is delivered, the shell immediately calls the cleanup
function as expected.