Revisions to Slurm Configuration v1.1.4 on Caviness

Revisions to Slurm Configuration v1.1.4 on Caviness

This document summarizes alterations to the Slurm job scheduler configuration on the Caviness cluster.

Issues

Staged reboots require RebootProgram

One feature of Slurm is that the sysadmin can request that node(s) be rebooted once all jobs running on them have completed. This sort of staged reboot of compute nodes is a useful way to handle rolling upgrades to the compute nodes' OS, for example, since the cluster schedules around the nodes that are awaiting reboot. Reboots are facilitated by an external program/command that must be explicitly configured in Slurm. At this time no RebootProgram is configured.

Solutions

Add a RebootProgram

Graceful reboots handled by the Linux init/systemd processes have the unfortunate tendency to hang on the clusters. A cyclical dependency exists that prevents Lustre and OPA services from shutting down properly. Thus, we typically have the node's BMC power cycle to effect a reboot. A simple script has been written that uses the local ipmitool if present, and otherwise attempts a forced reboot:

#!/bin/bash
#
# Force a reboot of the node.
#

IPMITOOL="$(which ipmitool)"
if [ $? -eq 0 ]; then
    "$IPMITOOL" power reset
else
    /usr/sbin/reboot --force
fi

Rather than installing this script on the NFS software share (/opt/shared/slurm) it will be sync'ed by Warewulf to the node's root filesystem. Should NFS service be down for some reason, the slurmd will still be able to effect a requested reboot.

Add /etc/slurm/libexec

A local Slurm-support executables directory — /etc/slurm/libexec — will be added with the configuration files in compute node VNFS images. The directory will have permissions making it accessible only by the Slurm user and group. Future support executables (e.g. prolog/epilog scripts) will also be sync'ed to this directory.

Implementation

Addition of the /etc/slurm/libexec directory to VNFS images and running root filesystems requires no reboots. The node-reboot script is imported into Warewulf and added to all compute node provisioning profiles for automatic distribution.

Slurm changes are effected by altering the configuration files, pushing the changed files to all nodes, and signaling a change in configuration so all daemons refresh their configuration.

Impact

No downtime is expected to be required. As this change is entirely operational in nature and has zero effect on users, it will be made without the usual review by stakeholders and users.

Timeline

Date	Time	Goal/Description
2019-05-22		Authoring of this document
		Changes made

Table of Contents