Integration of TCPLinda+Gaussian on Caviness
This document contains a summary of the steps necessary to tightly-integrate Gaussian multi-node parallelism (TCPLinda) in our Slurm clusters.
Scheduling
The current infrastructure for Gaussian jobs expects submissions like
#SBATCH --nodes=1 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=<C> #SBATCH --mem=<M> OR --mem-per-cpu=<M>
The allocated CPU core ids are used to construct a GAUSS_CDEF
environment variable and the on-node memory limit – minus fixed/per-core overhead – produces a value for GAUSS_MDEF
.
GAUSS_GDEF
if a GPU-enabled version of Gaussian is being used on a node with GPUs allocated to the job.
For multi-node Gaussian jobs, the submission profile would be expected to change to:
#SBATCH --nodes=<N> #SBATCH --ntasks-per-node=<T> #SBATCH --cpus-per-task=<C> #SBATCH --mem=<M> OR --mem-per-cpu=<M>
Each task would equate with a TCPLinda worker, and each worker would have <C> cores for SMP parallelism. The GAUSS_WDEF
environment variable would need to be set accordingly; e.g. for SLURM_JOB_NODELIST=r00n[00-02]
, SLURM_TASKS_PER_NODE=2
, and SLURM_CPUS_PER_TASK=18
the environment would contain:
Variable | Value |
---|---|
GAUSS_WDEF | r00n00:2,r00n01:2,r00n02:2 |
GAUSS_CDEF | Not set |
GAUSS_PDEF | =$SLURM_CPUS_PER_TASK |
GAUSS_MDEF | Memory per task - overhead |
Since Slurm will not in general allocate the same <C> CPU core ids to each task, the now-deprecated GAUSS_PDEF
CPU count must be used; otherwise, srun
will communicate the same GAUSS_CDEF
to each task and cores will fail to be bound. If future versions of Gaussian remove the GAUSS_PDEF
functionality entirely, it will be necessary either to allocate entire nodes (at 1 task-per-node) or the Linda worker startup will need to be wrapped by a script that reconfigures GAUSS_CDEF
on-the-fly. The latter is probably beneficial, as it would also allow GAUSS_MDEF
and GAUSS_GDEF
to be customized – which would be extremely important for heterogenous job allocations.
TCPLinda Worker Startup
The linda_rsh
script is used by TCPLinda to execute a remote command on a node participating in the job. By default it uses rsh
or ssh
to connect to the remote host. But on a Slurm cluster the srun
command must be used for proper job containment and accounting.
The linda_rsh
script would require modification to only make use of srun
. Since TCPLinda executes one instance of linda_rsh
per Linda task, the job's batch step would perform ((<N>*<T>)-1) invocations of linda_rsh
. Each such invocation will be made against a specific hostname (pulled from GAUSS_WDEF
). The srun
command would alter e.g. the ssh
formula
/usr/bin/ssh -x $host $user -n "$@"
to
srun --nodes=1 --ntasks=1 --cpus-per-task=${SLURM_CPUS_PER_TASK:-1} --nodelist=$host "$@"
Each TCPLinda worker (beyond the primary Gaussian process in the batch step) would be a separate job step in the final accounting of the job.