Full Support for Linux CGroups Using GECO
Several variants of Grid Engine claim to support integration with Linux CGroups to control resource allocations. One such variant was chosen for the University's Farber cluster partly for the sake of this feature. We were incredibly disappointed to eventually figure out for ourselves that the marketing hype was true in a very rudimentary sense. Some of the issues we elucidated were:
- The slot count multiplies many resource requests, like
m_mem_free
. As implemented, the multiplication is applied twice and results in incorrect resource limits. E.g. requestingm_mem_free=2G
for a 20-core threaded job, the memory.limit_in_bytes applied bysge_execd
to the job's shepherd was 80 GB. - For multi-node parallel jobs, remote shepherd processes created upon
qrsh -inherit
to a node never have CGroup limits applied to them.- Even if we created the appropriate CGroups on the remote nodes prior to
qrsh -inherit
(e.g. in a prolog script), thesge_execd
never adds the shepherd or its child processes to that CGroup.
- Grid Engine core binding assumes that all core usage is controlled by the
qmaster
; thesge_execd
native sensors do not provide feedback w.r.t. what cores are available/unavailable.- If a compute node were configured with some cores reserved for non-GE workload, the
qmaster
would still attempt to use them
- Allowing Grid Engine to use the cpuset CGroup to perform core-binding works on the master node for the job, but slave nodes have no core-binding applied to them even though the
qmaster
selected cores for them. - Using CGroup freezer to cleanup after jobs always results in the shepherd's being killed before its child processes, which signals to
sge_execd
that the job ended in error1). - A bug exists by which a requested
m_mem_free
which is larger than the requestedh_vmem
value for the job is ignored and theh_vmem
limit gets used for both. This is contrary to documentation.2).
The magnitude of the flaws and inconsistencies – and the fact that on our HPC cluster multi-node jobs form a significant percentage of the workload – meant that we could not make use of the native CGroup integration in this Grid Engine product, even though it was a primary reason for choosing the product.
Issues that Led to Termination of this Project
- The Linux kernel being used on the compute nodes is 2.6.32-504.30.3.el6.x86_64. There is a known bug in this kernel which manifests under heavy use of the cgroup facilities; this bug actually crashed the Farber head node the day after our 2016-01-06 maintenance and required that we revert the head node to 2.6.32-504.16.2.el6.x86_64. With GECO running, compute nodes were hitting this same kernel bug at a high enough rate that we either had to cease using GECO or attempt to rollback the kernel in the compute nodes' boot images.
- The Grid Engine epilog script was responsible for cleanup of the Cgroup environment for the job. As early as June 2015 we had suspicions that Grid Engine wasn't always executing the epilog script, but the vendor assured us that could not happen. However, it does happen: if the user removes/renames the working directory for a running job, Grid Engine will not run the epilog script and Cgroup cleanup would not happen. Orphaned cpuset Cgroups meant that the cores allocated to that job could not be bound to other jobs, and those subsequent jobs would fail to start.
- With the release of Grid Engine that we are using, the vendor introduced a mechanism to support faster delivery of job information. When doing a
qstat
, a set of read threads in the queue master are being queried. The read threads return the last-coalesced snapshot of the cluster's job state; updates of this information are controlled by a locking mechanism that surround the periodic scheduling run, etc. This means that for a lengthy scheduling run,qstat
won't actually return the real state information for a job that's been started running. We had to introduce some complicated logic as well as somesleep()
calls in order to fetch accurate job information. But that wasn't even enough, as we later foundqstat
to be returning partially-accurate job information for large array jobs. A fix to this flaw would have been necessary, but no solution other than more complexity and moresleep()
usage presented itself. - The
gecod
program could not seem to reliably readcpuset.cpus
from Cgroup subgroups it created. A processor binding would be produced and successfully written to e.g./cgroup/cpuset/GECO/132.4/cpuset.cpus
. Whengecod
scheduled the next job it would read/cgroup/cpuset/GECO/132.4/cpuset.cpus
in an attempt to determine what processors were available to the new job being scheduled. However, when/cgroup/cpuset/GECO/132.4/cpuset.cpus
was opened and read no data was present.
sleep()
delays around these reads, but to no consistently-reliable result.
- The GECO mechanism was also being used to gate
ssh
access to compute nodes. When ansshd
was started, the user id and the environment for the process were checked to determine whether or not thesshd
should be killed or allowed to execute. If thesshd
were owned byroot
then by default nothing was done to impede or alter its startup. Unfortunately, this meant thatqlogin
sessions – which start ansshd
under thesge-shepherd
process for the job – were never getting quarantined properly. A fix for this would have been necessary for GECO to be fully-functional. - All of the
sleep()
blocks introduced to address issues with stale/unavailable data increased the time necessary for job quarantine to the point where in many cases (during long scheduling runs) Grid Engine reached its threshold for awaitingsge-shepherd
startup for jobs and would simply mark them failed.
So while GECO worked quite well for us in small-scale testing on a subset of Farber's compute nodes, when scaled-up to our actual workload it failed utterly as the complexity of adapting to that scale under the conditions and software involved proved insurmountable.