====== The auto_tmpdir Plugin for Per-job Temporary Directories ====== Many jobs that run on a cluster create //temporary files// of some kind: * Open MPI stores information regarding the topology and configuration of the worker processes in ''$TMPDIR'' * VALET stores environment snapshots (on each ''vpkg_require'') in ''$TMPDIR'' * The PSM2 library (for accelerated MPI over Intel OPA) and VADER plugin create shared memory segments in ''/dev/shm'' * Some programs are hard-coded to use the ''/tmp'' directory directly (not ''$TMPDIR'') It is helpful to have Slurm automatically manage the lifetime of temporary files associated with jobs running on the cluster. ==== Bind mounts ==== The Linux virtual file system (VFS)- is a hierarchy of directories and files, referenced by //path//. That //namespace// is a single tree, unlike Windows where each distinct storage component is assigned a letter (your primary hard drive is your "C" drive, etc.). In Linux, though, the directories and files on each storage component are //mounted// at some position within the single VFS tree. On DARWIN, IT-maintained software can be found under ''/opt/shared'' on every node, yet those directories and files are not stored in disks present in that node; rather, they are stored on a file server whose storage is //mounted// at ''/opt/shared'' on the nodes. Interacting with the ''/opt'' directory requires Linux to access a local disk in the node, but moving to ''/opt/shared'' requires interaction with a file server across the network, all transparent to the user who is accessing the file system. A //bind mount// in Linux takes an existing directory in the VFS tree and mounts it at another path. This seems like a solution to our problem above: if the per-job temporary directory (e.g. ''/tmp/job-8451'') were bind-mounted at ''/tmp'', then programs that use ''/tmp'' rather than ''$TMPDIR'' would not leave files in ''/tmp'' when they crash — the files would be in ''/tmp/job-8451''. By the same token, bind-mounting ''/dev/shm/job-8451'' at ''/dev/shm'' will contain the shared memory segments created by PSM2 and VADER in Open MPI jobs. Unfortunately, once ''/tmp/job-8451'' is bind-mounted as ''/tmp'' every program on the node (including Slurm itself) will store its temporary files in ''/tmp/job-8451''. If the node is shared by multiple jobs this would be a major problem, since each subsequent job that starts will modify what's mounted at ''/tmp'': ^Position^Job^What's actually mounted^ |1|8451|''/tmp/job-8451''| |2|8456|''/tmp/job-8451/job-8456''| |3|8460|''/tmp/job-8451/job-8456/job-8460''| Once job 8456 has altered what's mounted at ''/tmp'', job 8451 will no longer see the temporary files it created in ''/tmp/job-8451'' and the program will likely crash. The same problem would exist for Slurm and OS programs that were using files in ''/tmp'' prior to job 8451's executing. ==== Namespaces to the rescue ==== For the bind-mount solution to work, each Slurm job needs to have its own VFS tree that is independent of other programs on the node. Linux //mount namespaces// are exactly that: * every program that executes starts with its parent's VFS tree * if the program has appropriate privileges, it can clone that initial VFS tree * storage components subsequently mounted/unmounted by the program only affect its own VFS tree For Slurm jobs this equates to: - When the job starts, the plugin creates a per-job temporary directory (''/tmp/job-8451'') - The plugin clones the VFS tree (now has a private mount namespace) - The plugin bind-mounts ''/tmp/job-8451'' as ''/tmp'' (''/tmp/job-8451'' no longer visible to this program) - When the job ends, the plugin unmounts ''/tmp'' (''/tmp/job-8451'' is again visible to this program) - The plugin removes ''/tmp/job-8451'' With the same procedure applied to ''/dev/shm'', the major sources of orphaned temporary files would be contained. ===== The auto_tmpdir plugin ===== The **auto_tmpdir** Slurm plugin creates the following paths and bind-mounts them: ^Directory created^Bind mountpoint^ |''/tmp/job-«job-id»''| | |''/tmp/job-«job-id»/tmp''|''/tmp''| |''/tmp/job-«job-id»/var_tmp''|''/var/tmp''| |''/dev/shm/job-«job-id»''|''/dev/shm''| ==== Shared tmpdir ==== In some cases the user may want the ''/tmp'' directory for the job to be shared by all nodes participating on the job — e.g. somewhere on ''/lustre''. The **auto_tmpdir** plugin implements a ''--use-shared-tmpdir'' flag to the **salloc/srun/sbatch** commands to request this: ^Directory created^Bind mountpoint^ |''/lustre/slurm/job-«job-id»''| | |''/lustre/slurm/job-«job-id»/tmp''|''/tmp''| |''/lustre/slurm/job-«job-id»/var_tmp''|''/var/tmp''| |''/dev/shm/job-«job-id»''|''/dev/shm''| A variant on the shared temporary directory scheme is to have each node use its own separate subdirectory (''--use-shared-tmpdir=per-node''): ^Directory created^Bind mountpoint^ |''/lustre/slurm/job-«job-id»''| | |''/lustre/slurm/job-«job-id»/«hostname»''| | |''/lustre/slurm/job-«job-id»/«hostname»/tmp''|''/tmp''| |''/lustre/slurm/job-«job-id»/«hostname»/var_tmp''|''/var/tmp''| |''/dev/shm/job-«job-id»''|''/dev/shm''| When using the ''--use-shared-tmpdir'' flag, the plugin can also be asked to //not// remove the directories when the job exits by including the ''--no-rm-tmpdir'' flag. The ''--no-rm-tmpdir'' flag should be used very cautiously, since leaving files behind on ''/lustre/scratch'' will consume capacity on that file system. A viable usage scenario would be debugging a job script that copies files to local scratch, runs a job, then copies results back to other storage. Once that behavior is debugged and goes into production the user would stop using the ''--no-rm-tmpdir'' and ''--use-shared-tmpdir'' flags. ===== Source code ===== The source code for the **auto_tmpdir** plugin is publicly available on [[https://github.com/jtfrey/auto_tmpdir/|Github]].