Better Integration with Grid Engine

Our first attempts at augmenting Grid Engine with our own CGroup support exploited the same process-creation notification mechanism that Linux's cgred (CGroups Rules Engine Daemon) uses to quarantine new processes. When a process forks, executes a program, or exits, the kernel will deliver information about that process to userland programs that are listening for such notifications via a netlink interface. A daemon can examine the information and react: for example, cgred will match a ruleset against the process information and possibly assign the process to one or more CGroups. In the case of Grid Engine jobs, the CGroup(s) must be created on-the-fly when a job begins, destroyed when the job ends, and component processes quarantined to the CGroup(s). Since at this time cgred has no such dynamicism in its behavior, we had to write our own daemon.

The daemon we wrote would react to a sge_shepherd process's forking a new process by creating the CGroup(s) if not present and assigning the new process to those CGroup(s). It would note the new process's pid so that later, when the kernel notified the daemon of that process's exit, the CGroup(s) could be destroyed.

This concept worked fine until we entered large-scale testing and found that processes that were children of the sge_shepherd did not always get assigned to the job's CGroup(s). This happened most often with jobs that would quickly spawn many child processes (e.g. an MPI job). The problem was that our daemon was written to react to processes that were the direct child of an sge_shepherd process. The kernel's netlink notification does not wait for acknowledgement from the userland listeners, blocking the process in question from executing. While our daemon was busy creating the CGroup(s) and assigning that pid to those CGroup(s), the job process itself was forking and executing additional programs.

Using kernel netlink notifications of process fork/exec to facilitate CGroup quarantine does not block the process from executing. A process that leads a group of child processes can fork to create new processes before the quarantine can occur, thus leaving some early child processes outside the desired CGroup.

Assigning a pid to a CGroup does not assign its child processes to that CGroup. Such assigment must be done recursively, but as all processes in question can be forking new processes while a daemon is recursively adding pids to the CGroup, it is never 100% assured that all processes have been quarantined by a daemon that employs netlink process state notifications.

One possible solution would be to watch all process forks and executions, and for each such process walk it's parent chain to determine if it descends from a sge_shepherd process. To do this would require extensive manipulation of the /proc filesystem for each netlink notification or the daemon's having a more extensive state table maintained for all processes it intercepts. With both solutions being complicated and slow, the simple solution of adding first children of the sge_shepherd process and having their children automatically be quarantined remained preferable. The only way to ensure that this method work properly is to block execution while the quarantine is being done.

Blocking Execution

What was necessary to properly handle the quarantine of a Grid Engine job in CGroup(s) created on-the-fly was the ability to prevent the processes from executing until our daemon had quarantined them. What we needed to do was somehow patch the kernel – or all programs – to do our CGroup quarantine work prior to issuing an exec() function call. When exec() is called the following should preceed the actual execution of the program:

Determine if the process is a child of a sge_shepherd
If so:
- Determine the job's resource profile (core counts, memory limits)
- Create necessary CGroup(s)
- Add the process's pid to the CGroup(s)
Proceed with the exec() function

Ideally, this patched behavior of exec() should apply only to the sge_execd daemon on the compute nodes; be inherited by sge_shepherd processes spawned by sge_execd; and be removed from processes the sge_shepherd spawns. The LD_PRELOAD functionality in Linux provides the appropriate mechanism for overriding a set of functions with your own implentation while retaining the ability to "call-through" to the original implementation.

With our own code handling the creation of the CGroup(s) we could address the inconsistencies noted in the vendor's implementation (wrong memory limits, lack of support on slave nodes). An LD_PRELOAD library implementing our own suite of exec() functions that perform quarantine and call-through to the OS-provided variants would produce the blocking behavior necessary.

Implementation Plan

The implementation of our support for Linux CGroups must necessarily be external to Grid Engine itself (since we use a commercial variant). It must consist of several components:

A shared library to be used via LD_PRELOAD and containing implementations of each exec() function from the standard C library (execve(), execle(), execvp(), et al.). Each function should determine whether or not the process must be quarantined in CGroup(s), perform the quarantine if so, and be able to call-through to the original implementation of itself. The environment being passed to the original exec() function may need to be altered to add/remove the LD_PRELOAD variable.
In general, root privileges are required to create CGroup containers, and the uid of the processes needing quarantine will not be root. The actual quarantine will be handled by a daemon running as root, and the LD_PRELOAD shared library will request quarantine by means of a socket interface.
The daemon should maintain enough state to avoid repeated requests for job resource information from the Grid Engine qmaster, as well as repeated walks of existing CGroup containers e.g. to select unique, unused cores. This should keep the overhead of the process low. However, the ability to fallback to the more time consuming methods is desirable.
Ideally, the daemon should also register for out-of-memory (OOM) notifications from the kernel for processes that it quarantines. Producing log information on such events would be beneficial for end users (versus Grid Engine's providing no information).

The majority of code that was written for our initial failed attempt at better CGroup integration is completely reusable. Gathering job resource information, creating CGroup containers, selecting optimal-placement core sets for a job, and listening for OOM notifications were all present; the only failure in the first design was the process of adding PIDs to the CGroup containers.

The expanded project was given the name Grid Engine Cgroup Orchestrator (GECO). The following schematic illustrates the major processes involved in GECO's handling of a job spanning two compute nodes (view as PDF):

Note that in this diagram a system's root secure shell daemon (sshd) also makes use of the LD_PRELOAD patches to the exec() functions. The LD_PRELOAD library contains logic to gate SSH access to compute nodes such that only SSH sessions carrying job credentials in their environment (and for the user who owns that job) will be permitted.

Connection to the kernel's out-of-memory notification facilities are also indicated.