Many jobs that run on a cluster create temporary files of some kind:
$TMPDIR
vpkg_require
) in $TMPDIR
/dev/shm
/tmp
directory directly (not $TMPDIR
)
Historically on Caviness the auto_tmpdir plugin created a per-job directory on a local scratch disk and set $TMPDIR
to that path. When the job completed, the per-job directory was automatically removed. This easily handled the first two items above (and any other software that was designed to reference $TMPDIR
for temporary file storage).
The latter two items have remained an issue, though, since it is up to the software (or even the user's job script) to remove the temporary files when the job completes. Crashes and early termination of Open MPI are particularly annoying because they leave behind many files in /dev/shm
, which is not stored on disk but in physical RAM and swap (which diminishes the amount of available RAM on the node over time). It was necessary for us to deploy on Farber and Caviness a periodic cleanup program that identifies which files in /dev/shm
are in-use and delete the rest.
The Linux virtual file system (VFS)- is a hierarchy of directories and files, referenced by path. That namespace is a single tree, unlike Windows where each distinct storage component is assigned a letter (your primary hard drive is your "C" drive, etc.). In Linux, though, the directories and files on each storage component are mounted at some position within the single VFS tree. On Caviness, IT-maintained software can be found under /opt/shared
on every node, yet those directories and files are not stored in disks present in that node; rather, they are stored on a file server whose storage is mounted at /opt/shared
on the nodes. Interacting with the /opt
directory requires Linux to access a local disk in the node, but moving to /opt/shared
requires interaction with a file server across the network, all transparent to the user who is accessing the file system.
A bind mount in Linux takes an existing directory in the VFS tree and mounts it at another path. This seems like a solution to our problem above: if the per-job temporary directory (e.g. /tmp/job-8451
) were bind-mounted at /tmp
, then programs that use /tmp
rather than $TMPDIR
would not leave files in /tmp
when they crash — the files would be in /tmp/job-8451
. By the same token, bind-mounting /dev/shm/job-8451
at /dev/shm
will contain the shared memory segments created by PSM2 and VADER in Open MPI jobs.
Unfortunately, once /tmp/job-8451
is bind-mounted as /tmp
every program on the node (including Slurm itself) will store its temporary files in /tmp/job-8451
. If the node is shared by multiple jobs this would be a major problem, since each subsequent job that starts will modify what's mounted at /tmp
:
Position | Job | What's actually mounted |
---|---|---|
1 | 8451 | /tmp/job-8451 |
2 | 8456 | /tmp/job-8451/job-8456 |
3 | 8460 | /tmp/job-8451/job-8456/job-8460 |
Once job 8456 has altered what's mounted at /tmp
, job 8451 will no longer see the temporary files it created in /tmp/job-8451
and the program will likely crash. The same problem would exist for Slurm and OS programs that were using files in /tmp
prior to job 8451's executing.
For the bind-mount solution to work, each Slurm job needs to have its own VFS tree that is independent of other programs on the node. Linux mount namespaces are exactly that:
For Slurm jobs this equates to:
/tmp/job-8451
)/tmp/job-8451
as /tmp
(/tmp/job-8451
no longer visible to this program)/tmp
(/tmp/job-8451
is again visible to this program)/tmp/job-8451
With the same procedure applied to /dev/shm
, the major sources of orphaned temporary files would be contained.
The original auto_tmpdir Slurm plugin has been rewritten (as of March 12, 2020) to no longer set $TMPDIR
to the per-job directory it creates. Instead, it creates the following paths and bind-mounts them:
Directory created | Bind mountpoint |
---|---|
/tmp/job-«job-id» | |
/tmp/job-«job-id»/tmp | /tmp |
/tmp/job-«job-id»/var_tmp | /var/tmp |
/dev/shm/job-«job-id» | /dev/shm |
In some cases the user may want the /tmp
directory for the job to be shared by all nodes participating on the job — e.g. somewhere on /lustre/scratch
. The auto_tmpdir plugin implements a –use-shared-tmpdir
flag to the salloc/srun/sbatch commands to request this:
Directory created | Bind mountpoint |
---|---|
/lustre/scratch/slurm/job-«job-id» | |
/lustre/scratch/slurm/job-«job-id»/tmp | /tmp |
/lustre/scratch/slurm/job-«job-id»/var_tmp | /var/tmp |
/dev/shm/job-«job-id» | /dev/shm |
A variant on the shared temporary directory scheme is to have each node use its own separate subdirectory (–use-shared-tmpdir=per-node
):
Directory created | Bind mountpoint |
---|---|
/lustre/scratch/slurm/job-«job-id» | |
/lustre/scratch/slurm/job-«job-id»/«hostname» | |
/lustre/scratch/slurm/job-«job-id»/«hostname»/tmp | /tmp |
/lustre/scratch/slurm/job-«job-id»/«hostname»/var_tmp | /var/tmp |
/dev/shm/job-«job-id» | /dev/shm |
When using the –use-shared-tmpdir
flag, the plugin can also be asked to not remove the directories when the job exits by including the –no-rm-tmpdir
flag.
The –no-rm-tmpdir
flag should be used very cautiously, since leaving files behind on /lustre/scratch
will consume capacity on that file system. A viable usage scenario would be debugging a job script that copies files to local scratch, runs a job, then copies results back to other storage. Once that behavior is debugged and goes into production the user would stop using the –no-rm-tmpdir
and –use-shared-tmpdir
flags.
The source code for the auto_tmpdir plugin is publicly available on Github.