technical:slurm:caviness:salloc-default-cmd-fixup

Revisions to Slurm Configuration v2.3.1 on Caviness

This document summarizes alterations to the Slurm job scheduler configuration on the Caviness cluster.

When the salloc command is used without a script and arguments to it, the value configured in the SallocDefaultCommand key (in /etc/slurm/slurm.conf) provides the default command to execute on the allocated resources. For example:

[(workgroup:user)@login01 ~]$ salloc --partition=devel
salloc: Granted job allocation 13065047
salloc: Waiting for resource configuration
salloc: Nodes r00n56 are ready for job
[user@r00n56 ~]$

The default command as defined in the Slurm configuration mirrors the suggested default from the Slurm developers:

SallocDefaultCommand="srun -n1 -N1 --mpi=none --pty $SHELL"

Thus, the salloc illustrated above is equivalent to:

[(workgroup:user)@login01 ~]$ salloc --partition=devel srun -n1 -N1 --mpi=none --pty $SHELL
salloc: Granted job allocation 13065047
salloc: Waiting for resource configuration
salloc: Nodes r00n56 are ready for job
[user@r00n56 ~]$

There are several issues with this default command:

  1. The inclusion of -n1 -N1 limits the remote shell to accessing a single task of the allocation
  2. The remote shell ($SHELL, coming from the user's current environment) is not executed as a login shell
  3. The srun by default propagates the user's current environment variables to the remote node(s); we generally do not recommend this behavior on Caviness

On the first point, the Slurm allocation may have been for -N1 -n4 -c8 (one node, four tasks, eight CPUs per task), but the remote shell will only have access to one task (with eight CPUs). The user more likely anticipated the remote shell's having access to the full set of resources allocated on the primary node assigned to the job, akin to the batch step in submitted job scripts.

The second and third points may prevent some runtime environment setup from happening; this can be problematic when exported environment variables are reconstituted in the runtime environment by Slurm, but unexported variables, aliases, and functions are not restored. Our best-practice for job scripts is to send no environment variables from the submission environment to the runtime environment; ideally, the same should be observed for interactive sessions.

To address the issue of all resources' on the primary node not being made available to the remote shell, the node and task counts will be dropped from the SallocDefaultCommand. The --cpu-bind=none flag will be added: otherwise, the shell defaults to having a task affinity mask applied by slurmstepd that restricts it to just one of the allocated physical CPU cores.

The majority of command shells recognize the -l flag as requesting login shell behavior. Appending a -l flag to the SallocDefaultCommand should be sufficient.

Finally, with regard to environment variable propagation, adding --export=NONE to the SallocDefaultCommand would implement the best-practice we seek to promote, but that behavior cannot be overridden with command line flags to salloc or via the environment (with SLURM_EXPORT_ENV). The only possible override is for a user to opt to not use the SallocDefaultCommand and provide an explicit command, e.g. an srun lacking the --export flag that appears in SallocDefaultCommand. The desired best-practice must be assumed to be the dominant use case (and will correlate with official documentation, for example), so adding --export=NONE to the SallocDefaultCommand is the correct choice.

This yields an altered SallocDefaultCommand of:

SallocDefaultCommand="srun --mpi=none --pty --export=NONE --cpu-bind=none $SHELL -l"

No downtime is necessary since this change affects the behavior of the salloc command (not any of the Slurm daemons). The new configuration will be pushed to all nodes and take effect immediately.

Date Time Goal/Description
2022-01-12 Authoring of this document
2022-01-1909:00Implementation
  • technical/slurm/caviness/salloc-default-cmd-fixup.txt
  • Last modified: 2022-01-14 13:21
  • by anita