====== Automated /dev/shm cleanup ======
Historically, //shared memory// was an IPC facility whereby one process could mark memory segments as being shared, and other processes on the same machine could map those segments into their memory space. Segments had a //key// that identified them, and the key had to be shared to dependent processes that wanted to map the segment. This was often accomplished by:
- master process sets up shared segments and exports an environment variable identifying the key(s)
- master process forks slave processes
- each slave process consults the appropriate environment variable for shared memory key(s)
- each slave maps the necessary shared segments into its memory space
When a program like this crashes, it often leaves its shared segments orphaned:
$ ipcs -m
------ Shared Memory Segments --------
key shmid owner perms bytes nattch status
0x00000000 45350912 frey 700 8388608 0
0x00000000 45383681 frey 700 8388608 0
0x00000000 45416450 frey 700 8388608 1
Behavior of such programs varies: it is possible for the master process to create and attach the segment(s) and immediately mark them for destruction, in which case when all processes that attach the segment exit it will immediately be recovered. This may not necessarily be appropriate in all cases, though, so a segment with an attachment count of zero is not necessarily orphaned. This negates any automated cleanup of segments with zero attachments. Instead, a user with access to the segment or someone with root privileges must manually remove the memory segments:
$ ipcrm --shmem-id 45350912 --shmem-id 45383681
With the advent of the memory-backed Linux ''tmpfs'', it became possible to implement a mechanism similar to IPC shared memory using files:
* a POSIX shared memory segment is backed by a file in ''/dev/shm''
* the backing file has standard Unix filesystem permissions applied to it
* the backing file can be mmap'ed by any process with access to the file
* the segment can be examined or removed using standard filesystem tools
Unlike IPC segments, POSIX segments cannot be marked for destruction when no longer attached to a process. When any program that creates POSIX segments crashes, it leaves behind files in ''/dev/shm''.
===== Cleaning-up =====
On our clusters a lot of Open MPI jobs run. When they crash, the vader BTL leaves behind orphaned POSIX shared memory segments. With Omni-Path PSM2, that BTL also creates POSIX shared memory segments. In both cases, the segments have no useful life beyond the run of the Open MPI program. This yields simple criteria for determining when the segments can be removed. For any PSM2 or vader segment file:
* create/modify/access timestamp must be very new (e.g. within the last hour),
* OR the file must be actively in-use by at least one process on the system
For arbitrary POSIX segment files, the same criteria with a longer timespan (perhaps 1 day) would target segments that can be purged.
==== Finding all shared memory segments ====
This is the stage where time-based criteria to disqualify segments for removal should be applied. Walking the ''/dev/shm'' directory tree and ''stat()'''ing each file provides the create/modify/access timestamps for the segments. Rather than marking each file for removal, the paths are distilled to just first-level entities: e.g. ''/dev/shm/a/b/c'' would equate to ''/dev/shm/a''. The set **A** of defined memory segments is produced.
==== Finding active shared memory segments ====
The ''lsof'' command has an option ''+D '' that instructs it to check all paths under the given directory. Using ''+D /dev/shm'' keeps the command's amount of work at a minimum and provides information only on the files that pertain to the cleanup action. Again, each path under ''/dev/shm'' is distilled to just first-level entities. The set **B** of in-use memory segments is produced.
==== Segments for removal ====
The set difference, **A** / **B**, is the set of all elements of **A** that are not in **B**. The resulting set **C** is the first-level paths under ''/dev/shm'' that we wish to remove.
==== Removing segments ====
Running as root, removal is accomplished using ''rm -rf '' for each path in **C**.
===== Implementation =====
The ''lsof'' utility must be present on a system for the procedure outlined above to work. We implemented the procedure in Python and have a cron job executing the program on a fixed period. The Python script is part of our [[https://github.com/jtfrey/ud_slurm_addons|Slurm additions project on Github]] (see the ''helpers'' directory).
The program has various command line options available:
$ shm-cleanup.py --help
usage: shm-cleanup.py [-h] [-v] [-q] [-n] [--show-log-timestamps]
[--age ] [--no-special-treatment]
[--log-file ] [--daemon]
[--daemon-period ] [--pid-file ]
Cleanup /dev/shm
optional arguments:
-h, --help show this help message and exit
-v, --verbose increase level of verbosity
-q, --quiet decrease level of verbosity
-n, --dry-run do not remove any files, just display what would be
done; this option sets the base verbosity level to
INFO (as in -vv)
--show-log-timestamps, -t
display timestamps on all messages logged by this
program
--age , -a
only items older than this will be removed; integer or
floating-point values are acceptable with optional
unit of s/m/h/d (default: d)
--no-special-treatment
do not treat PSM2 and vader segment files any
differently than other files
--log-file , -l
send all logging to this file instead of to stderr;
timestamps are always enabled when logging to a file
--daemon run as a daemon, periodically waking to re-check
--daemon-period
wake to re-check on the given period; integer or
floating-point values are acceptable with optional
unit of s/m/h/d (default: s)
--pid-file
in daemon mode, write our pid to this file (default:
/var/run/shm-cleanup.pid)
On systems that lack cron (or a similar timed-execution mechanism), the ''--daemon'' mode may be helpful.