Full Support for Linux CGroups Using GECO

Several variants of Grid Engine claim to support integration with Linux CGroups to control resource allocations. One such variant was chosen for the University's Farber cluster partly for the sake of this feature. We were incredibly disappointed to eventually figure out for ourselves that the marketing hype was true in a very rudimentary sense. Some of the issues we elucidated were:

The slot count multiplies many resource requests, like m_mem_free. As implemented, the multiplication is applied twice and results in incorrect resource limits. E.g. requesting m_mem_free=2G for a 20-core threaded job, the memory.limit_in_bytes applied by sge_execd to the job's shepherd was 80 GB.
For multi-node parallel jobs, remote shepherd processes created upon qrsh -inherit to a node never have CGroup limits applied to them.
- Even if we created the appropriate CGroups on the remote nodes prior to qrsh -inherit (e.g. in a prolog script), the sge_execd never adds the shepherd or its child processes to that CGroup.
Grid Engine core binding assumes that all core usage is controlled by the qmaster; the sge_execd native sensors do not provide feedback w.r.t. what cores are available/unavailable.
- If a compute node were configured with some cores reserved for non-GE workload, the qmaster would still attempt to use them
Allowing Grid Engine to use the cpuset CGroup to perform core-binding works on the master node for the job, but slave nodes have no core-binding applied to them even though the qmaster selected cores for them.
Using CGroup freezer to cleanup after jobs always results in the shepherd's being killed before its child processes, which signals to sge_execd that the job ended in error¹⁾.
A bug exists by which a requested m_mem_free which is larger than the requested h_vmem value for the job is ignored and the h_vmem limit gets used for both. This is contrary to documentation.²⁾.

The magnitude of the flaws and inconsistencies – and the fact that on our HPC cluster multi-node jobs form a significant percentage of the workload – meant that we could not make use of the native CGroup integration in this Grid Engine product, even though it was a primary reason for choosing the product.

As of 2016-01-15 the GECO project has ceased development. In each phase of testing we elucidated more and more inconsistencies (sometimes flaws) in the Grid Engine and Linux kernel implementations being used.

Issues that Led to Termination of this Project

The Linux kernel being used on the compute nodes is 2.6.32-504.30.3.el6.x86_64. There is a known bug in this kernel which manifests under heavy use of the cgroup facilities; this bug actually crashed the Farber head node the day after our 2016-01-06 maintenance and required that we revert the head node to 2.6.32-504.16.2.el6.x86_64. With GECO running, compute nodes were hitting this same kernel bug at a high enough rate that we either had to cease using GECO or attempt to rollback the kernel in the compute nodes' boot images.
The Grid Engine epilog script was responsible for cleanup of the Cgroup environment for the job. As early as June 2015 we had suspicions that Grid Engine wasn't always executing the epilog script, but the vendor assured us that could not happen. However, it does happen: if the user removes/renames the working directory for a running job, Grid Engine will not run the epilog script and Cgroup cleanup would not happen. Orphaned cpuset Cgroups meant that the cores allocated to that job could not be bound to other jobs, and those subsequent jobs would fail to start.
With the release of Grid Engine that we are using, the vendor introduced a mechanism to support faster delivery of job information. When doing a qstat, a set of read threads in the queue master are being queried. The read threads return the last-coalesced snapshot of the cluster's job state; updates of this information are controlled by a locking mechanism that surround the periodic scheduling run, etc. This means that for a lengthy scheduling run, qstat won't actually return the real state information for a job that's been started running. We had to introduce some complicated logic as well as some sleep() calls in order to fetch accurate job information. But that wasn't even enough, as we later found qstat to be returning partially-accurate job information for large array jobs. A fix to this flaw would have been necessary, but no solution other than more complexity and more sleep() usage presented itself.
The gecod program could not seem to reliably read cpuset.cpus from Cgroup subgroups it created. A processor binding would be produced and successfully written to e.g. /cgroup/cpuset/GECO/132.4/cpuset.cpus. When gecod scheduled the next job it would read /cgroup/cpuset/GECO/132.4/cpuset.cpus in an attempt to determine what processors were available to the new job being scheduled. However, when /cgroup/cpuset/GECO/132.4/cpuset.cpus was opened and read no data was present.

We experimented the week of 2016-01-11 with adding various logic and sleep() delays around these reads, but to no consistently-reliable result.

The GECO mechanism was also being used to gate ssh access to compute nodes. When an sshd was started, the user id and the environment for the process were checked to determine whether or not the sshd should be killed or allowed to execute. If the sshd were owned by root then by default nothing was done to impede or alter its startup. Unfortunately, this meant that qlogin sessions – which start an sshd under the sge-shepherd process for the job – were never getting quarantined properly. A fix for this would have been necessary for GECO to be fully-functional.
All of the sleep() blocks introduced to address issues with stale/unavailable data increased the time necessary for job quarantine to the point where in many cases (during long scheduling runs) Grid Engine reached its threshold for awaiting sge-shepherd startup for jobs and would simply mark them failed.

So while GECO worked quite well for us in small-scale testing on a subset of Farber's compute nodes, when scaled-up to our actual workload it failed utterly as the complexity of adapting to that scale under the conditions and software involved proved insurmountable.

¹⁾ , ²⁾

This issue was fixed in a subsequent release of the Grid Engine in question.