technical:whitepaper:automated_devshm_cleanup

Automated /dev/shm cleanup

Historically, shared memory was an IPC facility whereby one process could mark memory segments as being shared, and other processes on the same machine could map those segments into their memory space. Segments had a key that identified them, and the key had to be shared to dependent processes that wanted to map the segment. This was often accomplished by:

  1. master process sets up shared segments and exports an environment variable identifying the key(s)
  2. master process forks slave processes
    1. each slave process consults the appropriate environment variable for shared memory key(s)
    2. each slave maps the necessary shared segments into its memory space

When a program like this crashes, it often leaves its shared segments orphaned:

$ ipcs -m
 
------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0x00000000 45350912   frey       700        8388608    0                       
0x00000000 45383681   frey       700        8388608    0                       
0x00000000 45416450   frey       700        8388608    1

Behavior of such programs varies: it is possible for the master process to create and attach the segment(s) and immediately mark them for destruction, in which case when all processes that attach the segment exit it will immediately be recovered. This may not necessarily be appropriate in all cases, though, so a segment with an attachment count of zero is not necessarily orphaned. This negates any automated cleanup of segments with zero attachments. Instead, a user with access to the segment or someone with root privileges must manually remove the memory segments:

$ ipcrm --shmem-id 45350912 --shmem-id 45383681

With the advent of the memory-backed Linux tmpfs, it became possible to implement a mechanism similar to IPC shared memory using files:

  • a POSIX shared memory segment is backed by a file in /dev/shm
  • the backing file has standard Unix filesystem permissions applied to it
  • the backing file can be mmap'ed by any process with access to the file
  • the segment can be examined or removed using standard filesystem tools

Unlike IPC segments, POSIX segments cannot be marked for destruction when no longer attached to a process. When any program that creates POSIX segments crashes, it leaves behind files in /dev/shm.

On our clusters a lot of Open MPI jobs run. When they crash, the vader BTL leaves behind orphaned POSIX shared memory segments. With Omni-Path PSM2, that BTL also creates POSIX shared memory segments. In both cases, the segments have no useful life beyond the run of the Open MPI program. This yields simple criteria for determining when the segments can be removed. For any PSM2 or vader segment file:

  • create/modify/access timestamp must be very new (e.g. within the last hour),
  • OR the file must be actively in-use by at least one process on the system

For arbitrary POSIX segment files, the same criteria with a longer timespan (perhaps 1 day) would target segments that can be purged.

This is the stage where time-based criteria to disqualify segments for removal should be applied. Walking the /dev/shm directory tree and stat()'ing each file provides the create/modify/access timestamps for the segments. Rather than marking each file for removal, the paths are distilled to just first-level entities: e.g. /dev/shm/a/b/c would equate to /dev/shm/a. The set A of defined memory segments is produced.

The lsof command has an option +D <path> that instructs it to check all paths under the given directory. Using +D /dev/shm keeps the command's amount of work at a minimum and provides information only on the files that pertain to the cleanup action. Again, each path under /dev/shm is distilled to just first-level entities. The set B of in-use memory segments is produced.

The set difference, A / B, is the set of all elements of A that are not in B. The resulting set C is the first-level paths under /dev/shm that we wish to remove.

Running as root, removal is accomplished using rm -rf <path> for each path in C.

The lsof utility must be present on a system for the procedure outlined above to work. We implemented the procedure in Python and have a cron job executing the program on a fixed period. The Python script is part of our Slurm additions project on Github (see the helpers directory).

The program has various command line options available:

$ shm-cleanup.py --help
usage: shm-cleanup.py [-h] [-v] [-q] [-n] [--show-log-timestamps]
                      [--age <age-threshold>] [--no-special-treatment]
                      [--log-file <filename>] [--daemon]
                      [--daemon-period <period>] [--pid-file <filename>]
 
Cleanup /dev/shm
 
optional arguments:
  -h, --help            show this help message and exit
  -v, --verbose         increase level of verbosity
  -q, --quiet           decrease level of verbosity
  -n, --dry-run         do not remove any files, just display what would be
                        done; this option sets the base verbosity level to
                        INFO (as in -vv)
  --show-log-timestamps, -t
                        display timestamps on all messages logged by this
                        program
  --age <age-threshold>, -a <age-threshold>
                        only items older than this will be removed; integer or
                        floating-point values are acceptable with optional
                        unit of s/m/h/d (default: d)
  --no-special-treatment
                        do not treat PSM2 and vader segment files any
                        differently than other files
  --log-file <filename>, -l <filename>
                        send all logging to this file instead of to stderr;
                        timestamps are always enabled when logging to a file
  --daemon              run as a daemon, periodically waking to re-check
  --daemon-period <period>
                        wake to re-check on the given period; integer or
                        floating-point values are acceptable with optional
                        unit of s/m/h/d (default: s)
  --pid-file <filename>
                        in daemon mode, write our pid to this file (default:
                        /var/run/shm-cleanup.pid)

On systems that lack cron (or a similar timed-execution mechanism), the –daemon mode may be helpful.

  • technical/whitepaper/automated_devshm_cleanup.txt
  • Last modified: 2018-12-13 13:02
  • by frey