The auto_tmpdir Plugin for Per-job Temporary Directories

The auto_tmpdir Plugin for Per-job Temporary Directories

Many jobs that run on a cluster create temporary files of some kind:

Open MPI stores information regarding the topology and configuration of the worker processes in $TMPDIR
VALET stores environment snapshots (on each vpkg_require) in $TMPDIR
The PSM2 library (for accelerated MPI over Intel OPA) and VADER plugin create shared memory segments in /dev/shm
Some programs are hard-coded to use the /tmp directory directly (not $TMPDIR)

Historically on Caviness the auto_tmpdir plugin created a per-job directory on a local scratch disk and set $TMPDIR to that path. When the job completed, the per-job directory was automatically removed. This easily handled the first two items above (and any other software that was designed to reference $TMPDIR for temporary file storage).

The latter two items have remained an issue, though, since it is up to the software (or even the user's job script) to remove the temporary files when the job completes. Crashes and early termination of Open MPI are particularly annoying because they leave behind many files in /dev/shm, which is not stored on disk but in physical RAM and swap (which diminishes the amount of available RAM on the node over time). It was necessary for us to deploy on Farber and Caviness a periodic cleanup program that identifies which files in /dev/shm are in-use and delete the rest.

Bind mounts

The Linux virtual file system (VFS)- is a hierarchy of directories and files, referenced by path. That namespace is a single tree, unlike Windows where each distinct storage component is assigned a letter (your primary hard drive is your "C" drive, etc.). In Linux, though, the directories and files on each storage component are mounted at some position within the single VFS tree. On Caviness, IT-maintained software can be found under /opt/shared on every node, yet those directories and files are not stored in disks present in that node; rather, they are stored on a file server whose storage is mounted at /opt/shared on the nodes. Interacting with the /opt directory requires Linux to access a local disk in the node, but moving to /opt/shared requires interaction with a file server across the network, all transparent to the user who is accessing the file system.

A bind mount in Linux takes an existing directory in the VFS tree and mounts it at another path. This seems like a solution to our problem above: if the per-job temporary directory (e.g. /tmp/job-8451) were bind-mounted at /tmp, then programs that use /tmp rather than $TMPDIR would not leave files in /tmp when they crash — the files would be in /tmp/job-8451. By the same token, bind-mounting /dev/shm/job-8451 at /dev/shm will contain the shared memory segments created by PSM2 and VADER in Open MPI jobs.

Unfortunately, once /tmp/job-8451 is bind-mounted as /tmp every program on the node (including Slurm itself) will store its temporary files in /tmp/job-8451. If the node is shared by multiple jobs this would be a major problem, since each subsequent job that starts will modify what's mounted at /tmp:

Position	Job	What's actually mounted
1	8451	`/tmp/job-8451`
2	8456	`/tmp/job-8451/job-8456`
3	8460	`/tmp/job-8451/job-8456/job-8460`

Once job 8456 has altered what's mounted at /tmp, job 8451 will no longer see the temporary files it created in /tmp/job-8451 and the program will likely crash. The same problem would exist for Slurm and OS programs that were using files in /tmp prior to job 8451's executing.

Namespaces to the rescue

For the bind-mount solution to work, each Slurm job needs to have its own VFS tree that is independent of other programs on the node. Linux mount namespaces are exactly that:

every program that executes starts with its parent's VFS tree
if the program has appropriate privileges, it can clone that initial VFS tree
storage components subsequently mounted/unmounted by the program only affect its own VFS tree

For Slurm jobs this equates to:

When the job starts, the plugin creates a per-job temporary directory (/tmp/job-8451)
The plugin clones the VFS tree (now has a private mount namespace)
The plugin bind-mounts /tmp/job-8451 as /tmp (/tmp/job-8451 no longer visible to this program)
When the job ends, the plugin unmounts /tmp (/tmp/job-8451 is again visible to this program)
The plugin removed /tmp/job-8451

With the same procedure applied to /dev/shm, the major sources of orphaned temporary files would be contained.

The auto_tmpdir plugin

The original auto_tmpdir Slurm plugin has been rewritten (as of March 12, 2020) to no longer set $TMPDIR to the per-job directory it creates. Instead, it creates the following paths and bind-mounts them:

Directory created	Bind mountpoint
`/tmp/job-«job-id»`
`/tmp/job-«job-id»/tmp`	`/tmp`
`/tmp/job-«job-id»/var_tmp`	`/var/tmp`
`/dev/shm/job-«job-id»`	`/dev/shm`

Shared tmpdir

In some cases the user may want the /tmp directory for the job to be shared by all nodes participating on the job — e.g. somewhere on /lustre/scratch. The auto_tmpdir plugin implements a –use-shared-tmpdir flag to the salloc/srun/sbatch commands to request this:

Directory created	Bind mountpoint
`/lustre/scratch/slurm/job-«job-id»`
`/lustre/scratch/slurm/job-«job-id»/tmp`	`/tmp`
`/lustre/scratch/slurm/job-«job-id»/var_tmp`	`/var/tmp`
`/dev/shm/job-«job-id»`	`/dev/shm`

A variant on the shared temporary directory scheme is to have each node use its own separate subdirectory (–use-shared-tmpdir=per-node):

Directory created	Bind mountpoint
`/lustre/scratch/slurm/job-«job-id»`
`/lustre/scratch/slurm/job-«job-id»/«hostname»`
`/lustre/scratch/slurm/job-«job-id»/«hostname»/tmp`	`/tmp`
`/lustre/scratch/slurm/job-«job-id»/«hostname»/var_tmp`	`/var/tmp`
`/dev/shm/job-«job-id»`	`/dev/shm`

When using the –use-shared-tmpdir flag, the plugin can also be asked to not remove the directories when the job exits by including the –no-rm-tmpdir flag.

The –no-rm-tmpdir flag should be used very cautiously, since leaving files behind on /lustre/scratch will consume capacity on that file system. A viable usage scenario would be debugging a job script that copies files to local scratch, runs a job, then copies results back to other storage. Once that behavior is debugged and goes into production the user would stop using the –no-rm-tmpdir and –use-shared-tmpdir flags.

Source code

The source code for the auto_tmpdir plugin is publicly available on Github.