====== The auto_tmpdir Plugin for Per-job Temporary Directories ====== Many jobs that run on a cluster create //temporary files// of some kind: * Open MPI stores information regarding the topology and configuration of the worker processes in ''$TMPDIR'' * VALET stores environment snapshots (on each ''vpkg_require'') in ''$TMPDIR'' * The PSM2 library (for accelerated MPI over Intel OPA) and VADER plugin create shared memory segments in ''/dev/shm'' * Some programs are hard-coded to use the ''/tmp'' directory directly (not ''$TMPDIR'') Historically on Caviness the **auto_tmpdir** plugin created a per-job directory on a local scratch disk and set ''$TMPDIR'' to that path. When the job completed, the per-job directory was automatically removed. This easily handled the first two items above (and any other software that was designed to reference ''$TMPDIR'' for temporary file storage). The latter two items have remained an issue, though, since it is up to the software (or even the user's job script) to remove the temporary files when the job completes. Crashes and early termination of Open MPI are particularly annoying because they leave behind many files in ''/dev/shm'', which is not stored on disk but in physical RAM and swap (which diminishes the amount of available RAM on the node over time). It was necessary for us to deploy on Farber and Caviness a periodic cleanup program that identifies which files in ''/dev/shm'' are in-use and delete the rest. ==== Bind mounts ==== The Linux virtual file system (VFS)- is a hierarchy of directories and files, referenced by //path//. That //namespace// is a single tree, unlike Windows where each distinct storage component is assigned a letter (your primary hard drive is your "C" drive, etc.). In Linux, though, the directories and files on each storage component are //mounted// at some position within the single VFS tree. On Caviness, IT-maintained software can be found under ''/opt/shared'' on every node, yet those directories and files are not stored in disks present in that node; rather, they are stored on a file server whose storage is //mounted// at ''/opt/shared'' on the nodes. Interacting with the ''/opt'' directory requires Linux to access a local disk in the node, but moving to ''/opt/shared'' requires interaction with a file server across the network, all transparent to the user who is accessing the file system. A //bind mount// in Linux takes an existing directory in the VFS tree and mounts it at another path. This seems like a solution to our problem above: if the per-job temporary directory (e.g. ''/tmp/job-8451'') were bind-mounted at ''/tmp'', then programs that use ''/tmp'' rather than ''$TMPDIR'' would not leave files in ''/tmp'' when they crash — the files would be in ''/tmp/job-8451''. By the same token, bind-mounting ''/dev/shm/job-8451'' at ''/dev/shm'' will contain the shared memory segments created by PSM2 and VADER in Open MPI jobs. Unfortunately, once ''/tmp/job-8451'' is bind-mounted as ''/tmp'' every program on the node (including Slurm itself) will store its temporary files in ''/tmp/job-8451''. If the node is shared by multiple jobs this would be a major problem, since each subsequent job that starts will modify what's mounted at ''/tmp'': ^Position^Job^What's actually mounted^ |1|8451|''/tmp/job-8451''| |2|8456|''/tmp/job-8451/job-8456''| |3|8460|''/tmp/job-8451/job-8456/job-8460''| Once job 8456 has altered what's mounted at ''/tmp'', job 8451 will no longer see the temporary files it created in ''/tmp/job-8451'' and the program will likely crash. The same problem would exist for Slurm and OS programs that were using files in ''/tmp'' prior to job 8451's executing. ==== Namespaces to the rescue ==== For the bind-mount solution to work, each Slurm job needs to have its own VFS tree that is independent of other programs on the node. Linux //mount namespaces// are exactly that: * every program that executes starts with its parent's VFS tree * if the program has appropriate privileges, it can clone that initial VFS tree * storage components subsequently mounted/unmounted by the program only affect its own VFS tree For Slurm jobs this equates to: - When the job starts, the plugin creates a per-job temporary directory (''/tmp/job-8451'') - The plugin clones the VFS tree (now has a private mount namespace) - The plugin bind-mounts ''/tmp/job-8451'' as ''/tmp'' (''/tmp/job-8451'' no longer visible to this program) - When the job ends, the plugin unmounts ''/tmp'' (''/tmp/job-8451'' is again visible to this program) - The plugin removed ''/tmp/job-8451'' With the same procedure applied to ''/dev/shm'', the major sources of orphaned temporary files would be contained. ===== The auto_tmpdir plugin ===== The original **auto_tmpdir** Slurm plugin has been rewritten (as of March 12, 2020) to no longer set ''$TMPDIR'' to the per-job directory it creates. Instead, it creates the following paths and bind-mounts them: ^Directory created^Bind mountpoint^ |''/tmp/job-«job-id»''| | |''/tmp/job-«job-id»/tmp''|''/tmp''| |''/tmp/job-«job-id»/var_tmp''|''/var/tmp''| |''/dev/shm/job-«job-id»''|''/dev/shm''| ==== Shared tmpdir ==== In some cases the user may want the ''/tmp'' directory for the job to be shared by all nodes participating on the job — e.g. somewhere on ''/lustre/scratch''. The **auto_tmpdir** plugin implements a ''--use-shared-tmpdir'' flag to the **salloc/srun/sbatch** commands to request this: ^Directory created^Bind mountpoint^ |''/lustre/scratch/slurm/job-«job-id»''| | |''/lustre/scratch/slurm/job-«job-id»/tmp''|''/tmp''| |''/lustre/scratch/slurm/job-«job-id»/var_tmp''|''/var/tmp''| |''/dev/shm/job-«job-id»''|''/dev/shm''| A variant on the shared temporary directory scheme is to have each node use its own separate subdirectory (''--use-shared-tmpdir=per-node''): ^Directory created^Bind mountpoint^ |''/lustre/scratch/slurm/job-«job-id»''| | |''/lustre/scratch/slurm/job-«job-id»/«hostname»''| | |''/lustre/scratch/slurm/job-«job-id»/«hostname»/tmp''|''/tmp''| |''/lustre/scratch/slurm/job-«job-id»/«hostname»/var_tmp''|''/var/tmp''| |''/dev/shm/job-«job-id»''|''/dev/shm''| When using the ''--use-shared-tmpdir'' flag, the plugin can also be asked to //not// remove the directories when the job exits by including the ''--no-rm-tmpdir'' flag. The ''--no-rm-tmpdir'' flag should be used very cautiously, since leaving files behind on ''/lustre/scratch'' will consume capacity on that file system. A viable usage scenario would be debugging a job script that copies files to local scratch, runs a job, then copies results back to other storage. Once that behavior is debugged and goes into production the user would stop using the ''--no-rm-tmpdir'' and ''--use-shared-tmpdir'' flags. ===== Source code ===== The source code for the **auto_tmpdir** plugin is publicly available on [[https://github.com/jtfrey/auto_tmpdir/|Github]].