technical:slurm:auto_tmpdir

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revisionBoth sides next revision
technical:slurm:auto_tmpdir [2020-03-12 11:03] freytechnical:slurm:auto_tmpdir [2020-03-12 11:42] – [The auto_tmpdir plugin] frey
Line 47: Line 47:
 ===== The auto_tmpdir plugin ===== ===== The auto_tmpdir plugin =====
  
 +The original **auto_tmpdir** Slurm plugin has been rewritten (as of March 12, 2020) to no longer set ''$TMPDIR'' to the per-job directory it creates.  Instead, it creates the following paths and bind-mounts them:
 +
 +^Directory created^Bind mountpoint^
 +|''/tmp/job-«job-id»''| |
 +|''/tmp/job-«job-id»/tmp''|''/tmp''|
 +|''/tmp/job-«job-id»/var_tmp''|''/var/tmp''|
 +|''/dev/shm/job-«job-id»''|''/dev/shm''|
 +
 +In some cases the user may want the ''/tmp'' directory for the job to be shared by all nodes participating on the job — e.g. somewhere on ''/lustre/scratch'' The **auto_tmpdir** plugin implements a ''--use-shared-tmpdir'' flag to the **salloc/srun/sbatch** commands to request this:
 +
 +^Directory created^Bind mountpoint^
 +|''/lustre/scratch/slurm/job-«job-id»''| |
 +|''/lustre/scratch/slurm/job-«job-id»/tmp''|''/tmp''|
 +|''/lustre/scratch/slurm/job-«job-id»/var_tmp''|''/var/tmp''|
 +|''/dev/shm/job-«job-id»''|''/dev/shm''|
 +
 +A variant on the shared temporary directory scheme is to have each node use its own separate subdirectory (''--use-shared-tmpdir=per-node''):
 +
 +^Directory created^Bind mountpoint^
 +|''/lustre/scratch/slurm/job-«job-id»''| |
 +|''/lustre/scratch/slurm/job-«job-id»/«hostname»''| |
 +|''/lustre/scratch/slurm/job-«job-id»/«hostname»/tmp''|''/tmp''|
 +|''/lustre/scratch/slurm/job-«job-id»/«hostname»/var_tmp''|''/var/tmp''|
 +|''/dev/shm/job-«job-id»''|''/dev/shm''|
 +
 +When using the ''--use-shared-tmpdir'' flag, the plugin can also be asked to //not// remove the directories when the job exits by including the ''--no-rm-tmpdir'' flag.
 +
 +<WRAP center round important 60%>
 +The ''--no-rm-tmpdir'' flag should be used very cautiously, since leaving files behind on ''/lustre/scratch'' will consume capacity on that file system.  A viable usage scenario would be debugging a job script that copies files to local scratch, runs a job, then copies results back to other storage.  Once that behavior is debugged and goes into production the user would stop using the ''--no-rm-tmpdir'' and ''--use-shared-tmpdir'' flags.
 +</WRAP>