technical:slurm:darwin:auto_tmpdir

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

technical:slurm:darwin:auto_tmpdir [2020-03-12 11:44] – external edit 127.0.0.1technical:slurm:darwin:auto_tmpdir [2021-01-06 12:37] (current) frey
Line 8: Line 8:
   * Some programs are hard-coded to use the ''/tmp'' directory directly (not ''$TMPDIR'')   * Some programs are hard-coded to use the ''/tmp'' directory directly (not ''$TMPDIR'')
  
-Historically on Caviness the **auto_tmpdir** plugin created a per-job directory on a local scratch disk and set ''$TMPDIR'' to that path.  When the job completed, the per-job directory was automatically removed.  This easily handled the first two items above (and any other software that was designed to reference ''$TMPDIR'' for temporary file storage). +It is helpful to have Slurm automatically manage the lifetime of temporary files associated with jobs running on the cluster.
- +
-The latter two items have remained an issue, though, since it is up to the software (or even the user's job script) to remove the temporary files when the job completes.  Crashes and early termination of Open MPI are particularly annoying because they leave behind many files in ''/dev/shm'', which is not stored on disk but in physical RAM and swap (which diminishes the amount of available RAM on the node over time).  It was necessary for us to deploy on Farber and Caviness a periodic cleanup program that identifies which files in ''/dev/shm'' are in-use and delete the rest.+
  
 ==== Bind mounts ==== ==== Bind mounts ====
  
-The Linux virtual file system (VFS)- is a hierarchy of directories and files, referenced by //path// That //namespace// is a single tree, unlike Windows where each distinct storage component is assigned a letter (your primary hard drive is your "C" drive, etc.).  In Linux, though, the directories and files on each storage component are //mounted// at some position within the single VFS tree.  On Caviness, IT-maintained software can be found under ''/opt/shared'' on every node, yet those directories and files are not stored in disks present in that node; rather, they are stored on a file server whose storage is //mounted// at ''/opt/shared'' on the nodes.  Interacting with the ''/opt'' directory requires Linux to access a local disk in the node, but moving to ''/opt/shared'' requires interaction with a file server across the network, all transparent to the user who is accessing the file system.+The Linux virtual file system (VFS)- is a hierarchy of directories and files, referenced by //path// That //namespace// is a single tree, unlike Windows where each distinct storage component is assigned a letter (your primary hard drive is your "C" drive, etc.).  In Linux, though, the directories and files on each storage component are //mounted// at some position within the single VFS tree.  On DARWIN, IT-maintained software can be found under ''/opt/shared'' on every node, yet those directories and files are not stored in disks present in that node; rather, they are stored on a file server whose storage is //mounted// at ''/opt/shared'' on the nodes.  Interacting with the ''/opt'' directory requires Linux to access a local disk in the node, but moving to ''/opt/shared'' requires interaction with a file server across the network, all transparent to the user who is accessing the file system.
  
 A //bind mount// in Linux takes an existing directory in the VFS tree and mounts it at another path.  This seems like a solution to our problem above:  if the per-job temporary directory (e.g. ''/tmp/job-8451'') were bind-mounted at ''/tmp'', then programs that use ''/tmp'' rather than ''$TMPDIR'' would not leave files in ''/tmp'' when they crash — the files would be in ''/tmp/job-8451'' By the same token, bind-mounting ''/dev/shm/job-8451'' at ''/dev/shm'' will contain the shared memory segments created by PSM2 and VADER in Open MPI jobs. A //bind mount// in Linux takes an existing directory in the VFS tree and mounts it at another path.  This seems like a solution to our problem above:  if the per-job temporary directory (e.g. ''/tmp/job-8451'') were bind-mounted at ''/tmp'', then programs that use ''/tmp'' rather than ''$TMPDIR'' would not leave files in ''/tmp'' when they crash — the files would be in ''/tmp/job-8451'' By the same token, bind-mounting ''/dev/shm/job-8451'' at ''/dev/shm'' will contain the shared memory segments created by PSM2 and VADER in Open MPI jobs.
Line 41: Line 39:
   - The plugin bind-mounts ''/tmp/job-8451'' as ''/tmp'' (''/tmp/job-8451'' no longer visible to this program)   - The plugin bind-mounts ''/tmp/job-8451'' as ''/tmp'' (''/tmp/job-8451'' no longer visible to this program)
   - When the job ends, the plugin unmounts ''/tmp'' (''/tmp/job-8451'' is again visible to this program)   - When the job ends, the plugin unmounts ''/tmp'' (''/tmp/job-8451'' is again visible to this program)
-  - The plugin removed ''/tmp/job-8451''+  - The plugin removes ''/tmp/job-8451''
  
 With the same procedure applied to ''/dev/shm'', the major sources of orphaned temporary files would be contained. With the same procedure applied to ''/dev/shm'', the major sources of orphaned temporary files would be contained.
Line 47: Line 45:
 ===== The auto_tmpdir plugin ===== ===== The auto_tmpdir plugin =====
  
-The original **auto_tmpdir** Slurm plugin has been rewritten (as of March 12, 2020) to no longer set ''$TMPDIR'' to the per-job directory it creates.  Instead, it creates the following paths and bind-mounts them:+The **auto_tmpdir** Slurm plugin creates the following paths and bind-mounts them:
  
 ^Directory created^Bind mountpoint^ ^Directory created^Bind mountpoint^
Line 57: Line 55:
 ==== Shared tmpdir ==== ==== Shared tmpdir ====
  
-In some cases the user may want the ''/tmp'' directory for the job to be shared by all nodes participating on the job — e.g. somewhere on ''/lustre/scratch'' The **auto_tmpdir** plugin implements a ''--use-shared-tmpdir'' flag to the **salloc/srun/sbatch** commands to request this:+In some cases the user may want the ''/tmp'' directory for the job to be shared by all nodes participating on the job — e.g. somewhere on ''/lustre'' The **auto_tmpdir** plugin implements a ''--use-shared-tmpdir'' flag to the **salloc/srun/sbatch** commands to request this:
  
 ^Directory created^Bind mountpoint^ ^Directory created^Bind mountpoint^
-|''/lustre/scratch/slurm/job-«job-id»''| | +|''/lustre/slurm/job-«job-id»''| | 
-|''/lustre/scratch/slurm/job-«job-id»/tmp''|''/tmp''+|''/lustre/slurm/job-«job-id»/tmp''|''/tmp''
-|''/lustre/scratch/slurm/job-«job-id»/var_tmp''|''/var/tmp''|+|''/lustre/slurm/job-«job-id»/var_tmp''|''/var/tmp''|
 |''/dev/shm/job-«job-id»''|''/dev/shm''| |''/dev/shm/job-«job-id»''|''/dev/shm''|
  
Line 68: Line 66:
  
 ^Directory created^Bind mountpoint^ ^Directory created^Bind mountpoint^
-|''/lustre/scratch/slurm/job-«job-id»''| | +|''/lustre/slurm/job-«job-id»''| | 
-|''/lustre/scratch/slurm/job-«job-id»/«hostname»''| | +|''/lustre/slurm/job-«job-id»/«hostname»''| | 
-|''/lustre/scratch/slurm/job-«job-id»/«hostname»/tmp''|''/tmp''+|''/lustre/slurm/job-«job-id»/«hostname»/tmp''|''/tmp''
-|''/lustre/scratch/slurm/job-«job-id»/«hostname»/var_tmp''|''/var/tmp''|+|''/lustre/slurm/job-«job-id»/«hostname»/var_tmp''|''/var/tmp''|
 |''/dev/shm/job-«job-id»''|''/dev/shm''| |''/dev/shm/job-«job-id»''|''/dev/shm''|
  
  • technical/slurm/darwin/auto_tmpdir.1584027861.txt.gz
  • Last modified: 2020-03-12 11:44
  • by 127.0.0.1