technical:slurm:auto_tmpdir

This is an old revision of the document!


The auto_tmpdir Plugin for Per-job Temporary Directories

Many jobs that run on a cluster create temporary files of some kind:

  • Open MPI stores information regarding the topology and configuration of the worker processes in $TMPDIR
  • VALET stores environment snapshots (on each vpkg_require) in $TMPDIR
  • The PSM2 library (for accelerated MPI over Intel OPA) and VADER plugin create shared memory segments in /dev/shm
  • Some programs are hard-coded to use the /tmp directory directly (not $TMPDIR)

Historically on Caviness the auto_tmpdir plugin created a per-job directory on a local scratch disk and set $TMPDIR to that path. When the job completed, the per-job directory was automatically removed. This easily handled the first two items above (and any other software that was designed to reference $TMPDIR for temporary file storage).

The latter two items have remained an issue, though, since it is up to the software (or even the user's job script) to remove the temporary files when the job completes. Crashes and early termination of Open MPI are particularly annoying because they leave behind many files in /dev/shm, which is not stored on disk but in physical RAM and swap (which diminishes the amount of available RAM on the node over time). It was necessary for us to deploy on Farber and Caviness a periodic cleanup program that identifies which files in /dev/shm are in-use and delete the rest.

The Linux virtual file system (VFS)- is a hierarchy of directories and files, referenced by path. That namespace is a single tree, unlike Windows where each distinct storage component is assigned a letter (your primary hard drive is your "C" drive, etc.). In Linux, though, the directories and files on each storage component are mounted at some position within the single VFS tree. On Caviness, IT-maintained software can be found under /opt/shared on every node, yet those directories and files are not stored in disks present in that node; rather, they are stored on a file server whose storage is mounted at /opt/shared on the nodes. Interacting with the /opt directory requires Linux to access a local disk in the node, but moving to /opt/shared requires interaction with a file server across the network, all transparent to the user who is accessing the file system.

A bind mount in Linux takes an existing directory in the VFS tree and mounts it at another path. This seems like a solution to our problem above: if the per-job temporary directory (e.g. /tmp/job-8451) were bind-mounted at /tmp, then programs that use /tmp rather than $TMPDIR would not leave files in /tmp when they crash — the files would be in /tmp/job-8451. By the same token, bind-mounting /dev/shm/job-8451 at /dev/shm will contain the shared memory segments created by PSM2 and VADER in Open MPI jobs.

Unfortunately, once /tmp/job-8451 is bind-mounted as /tmp every program on the node (including Slurm itself) will store its temporary files in /tmp/job-8451. If the node is shared by multiple jobs this would be a major problem, since each subsequent job that starts will modify what's mounted at /tmp:

PositionJobWhat's actually mounted
18451/tmp/job-8451
28456/tmp/job-8451/job-8456
38460/tmp/job-8451/job-8456/job-8460

Once job 8456 has altered what's mounted at /tmp, job 8451 will no longer see the temporary files it created in /tmp/job-8451 and the program will likely crash. The same problem would exist for Slurm and OS programs that were using files in /tmp prior to job 8451's executing.

For the bind-mount solution to work, each Slurm job needs to have its own VFS tree that is independent of other programs on the node. Linux mount namespaces are exactly that:

  • every program that executes starts with its parent's VFS tree
  • if the program has appropriate privileges, it can clone that initial VFS tree
  • storage components subsequently mounted/unmounted by the program only affect its own VFS tree

For Slurm jobs this equates to:

  1. When the job starts, the plugin creates a per-job temporary directory (/tmp/job-8451)
  2. The plugin clones the VFS tree (now has a private mount namespace)
  3. The plugin bind-mounts /tmp/job-8451 as /tmp (/tmp/job-8451 no longer visible to this program)
  4. When the job ends, the plugin unmounts /tmp (/tmp/job-8451 is again visible to this program)
  5. The plugin removed /tmp/job-8451

With the same procedure applied to /dev/shm, the major sources of orphaned temporary files would be contained.

  • technical/slurm/auto_tmpdir.1584025384.txt.gz
  • Last modified: 2020-03-12 11:03
  • by frey