technical:slurm:auto_tmpdir

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Last revisionBoth sides next revision
technical:slurm:auto_tmpdir [2020-03-12 10:47] – created freytechnical:slurm:auto_tmpdir [2020-03-12 11:44] frey
Line 12: Line 12:
 The latter two items have remained an issue, though, since it is up to the software (or even the user's job script) to remove the temporary files when the job completes.  Crashes and early termination of Open MPI are particularly annoying because they leave behind many files in ''/dev/shm'', which is not stored on disk but in physical RAM and swap (which diminishes the amount of available RAM on the node over time).  It was necessary for us to deploy on Farber and Caviness a periodic cleanup program that identifies which files in ''/dev/shm'' are in-use and delete the rest. The latter two items have remained an issue, though, since it is up to the software (or even the user's job script) to remove the temporary files when the job completes.  Crashes and early termination of Open MPI are particularly annoying because they leave behind many files in ''/dev/shm'', which is not stored on disk but in physical RAM and swap (which diminishes the amount of available RAM on the node over time).  It was necessary for us to deploy on Farber and Caviness a periodic cleanup program that identifies which files in ''/dev/shm'' are in-use and delete the rest.
  
-===== Bind Mounts =====+==== Bind mounts ====
  
 The Linux virtual file system (VFS)- is a hierarchy of directories and files, referenced by //path// That //namespace// is a single tree, unlike Windows where each distinct storage component is assigned a letter (your primary hard drive is your "C" drive, etc.).  In Linux, though, the directories and files on each storage component are //mounted// at some position within the single VFS tree.  On Caviness, IT-maintained software can be found under ''/opt/shared'' on every node, yet those directories and files are not stored in disks present in that node; rather, they are stored on a file server whose storage is //mounted// at ''/opt/shared'' on the nodes.  Interacting with the ''/opt'' directory requires Linux to access a local disk in the node, but moving to ''/opt/shared'' requires interaction with a file server across the network, all transparent to the user who is accessing the file system. The Linux virtual file system (VFS)- is a hierarchy of directories and files, referenced by //path// That //namespace// is a single tree, unlike Windows where each distinct storage component is assigned a letter (your primary hard drive is your "C" drive, etc.).  In Linux, though, the directories and files on each storage component are //mounted// at some position within the single VFS tree.  On Caviness, IT-maintained software can be found under ''/opt/shared'' on every node, yet those directories and files are not stored in disks present in that node; rather, they are stored on a file server whose storage is //mounted// at ''/opt/shared'' on the nodes.  Interacting with the ''/opt'' directory requires Linux to access a local disk in the node, but moving to ''/opt/shared'' requires interaction with a file server across the network, all transparent to the user who is accessing the file system.
Line 27: Line 27:
 Once job 8456 has altered what's mounted at ''/tmp'', job 8451 will no longer see the temporary files it created in ''/tmp/job-8451'' and the program will likely crash.  The same problem would exist for Slurm and OS programs that were using files in ''/tmp'' prior to job 8451's executing. Once job 8456 has altered what's mounted at ''/tmp'', job 8451 will no longer see the temporary files it created in ''/tmp/job-8451'' and the program will likely crash.  The same problem would exist for Slurm and OS programs that were using files in ''/tmp'' prior to job 8451's executing.
  
-===== Namespaces to the Rescue =====+==== Namespaces to the rescue ====
  
 For the bind-mount solution to work, each Slurm job needs to have its own VFS tree that is independent of other programs on the node.  Linux //mount namespaces// are exactly that: For the bind-mount solution to work, each Slurm job needs to have its own VFS tree that is independent of other programs on the node.  Linux //mount namespaces// are exactly that:
Line 44: Line 44:
  
 With the same procedure applied to ''/dev/shm'', the major sources of orphaned temporary files would be contained. With the same procedure applied to ''/dev/shm'', the major sources of orphaned temporary files would be contained.
 +
 +===== The auto_tmpdir plugin =====
 +
 +The original **auto_tmpdir** Slurm plugin has been rewritten (as of March 12, 2020) to no longer set ''$TMPDIR'' to the per-job directory it creates.  Instead, it creates the following paths and bind-mounts them:
 +
 +^Directory created^Bind mountpoint^
 +|''/tmp/job-«job-id»''| |
 +|''/tmp/job-«job-id»/tmp''|''/tmp''|
 +|''/tmp/job-«job-id»/var_tmp''|''/var/tmp''|
 +|''/dev/shm/job-«job-id»''|''/dev/shm''|
 +
 +==== Shared tmpdir ====
 +
 +In some cases the user may want the ''/tmp'' directory for the job to be shared by all nodes participating on the job — e.g. somewhere on ''/lustre/scratch'' The **auto_tmpdir** plugin implements a ''--use-shared-tmpdir'' flag to the **salloc/srun/sbatch** commands to request this:
 +
 +^Directory created^Bind mountpoint^
 +|''/lustre/scratch/slurm/job-«job-id»''| |
 +|''/lustre/scratch/slurm/job-«job-id»/tmp''|''/tmp''|
 +|''/lustre/scratch/slurm/job-«job-id»/var_tmp''|''/var/tmp''|
 +|''/dev/shm/job-«job-id»''|''/dev/shm''|
 +
 +A variant on the shared temporary directory scheme is to have each node use its own separate subdirectory (''--use-shared-tmpdir=per-node''):
 +
 +^Directory created^Bind mountpoint^
 +|''/lustre/scratch/slurm/job-«job-id»''| |
 +|''/lustre/scratch/slurm/job-«job-id»/«hostname»''| |
 +|''/lustre/scratch/slurm/job-«job-id»/«hostname»/tmp''|''/tmp''|
 +|''/lustre/scratch/slurm/job-«job-id»/«hostname»/var_tmp''|''/var/tmp''|
 +|''/dev/shm/job-«job-id»''|''/dev/shm''|
 +
 +When using the ''--use-shared-tmpdir'' flag, the plugin can also be asked to //not// remove the directories when the job exits by including the ''--no-rm-tmpdir'' flag.
 +
 +<WRAP center round important 60%>
 +The ''--no-rm-tmpdir'' flag should be used very cautiously, since leaving files behind on ''/lustre/scratch'' will consume capacity on that file system.  A viable usage scenario would be debugging a job script that copies files to local scratch, runs a job, then copies results back to other storage.  Once that behavior is debugged and goes into production the user would stop using the ''--no-rm-tmpdir'' and ''--use-shared-tmpdir'' flags.
 +</WRAP>
 +
 +===== Source code =====
 +
 +The source code for the **auto_tmpdir** plugin is publicly available on [[https://github.com/jtfrey/auto_tmpdir/|Github]].