====== Full Support for Linux CGroups Using GECO ======

Several variants of Grid Engine claim to support integration with Linux CGroups to control resource allocations.  One such variant was chosen for the University's Farber cluster partly for the sake of this feature.  We were incredibly disappointed to eventually figure out for ourselves that the marketing hype was true in a very rudimentary sense.  Some of the issues we elucidated were:

  * The slot count multiplies many resource requests, like ''m_mem_free''.  As implemented, the multiplication is applied twice and results in incorrect resource limits.  E.g. requesting ''m_mem_free=2G'' for a 20-core threaded job, the **memory.limit_in_bytes** applied by ''sge_execd'' to the job's shepherd was 80 GB.
  * For multi-node parallel jobs, remote shepherd processes created upon ''qrsh -inherit'' to a node never have CGroup limits applied to them.
    * Even if we created the appropriate CGroups on the remote nodes prior to ''qrsh -inherit'' (e.g. in a prolog script), the ''sge_execd'' never adds the shepherd or its child processes to that CGroup.
  * Grid Engine core binding assumes that all core usage is controlled by the ''qmaster''; the ''sge_execd'' native sensors do not provide feedback w.r.t. what cores are available/unavailable.
    * If a compute node were configured with some cores reserved for non-GE workload, the ''qmaster'' would still attempt to use them
  * Allowing Grid Engine to use the **cpuset** CGroup to perform core-binding works on the master node for the job, but slave nodes have no core-binding applied to them even though the ''qmaster'' selected cores for them.
  * Using CGroup **freezer** to cleanup after jobs always results in the shepherd's being killed before its child processes, which signals to ''sge_execd'' that the job ended in error((This issue was fixed in a subsequent release of the Grid Engine in question.)).
  * A bug exists by which a requested ''m_mem_free'' which is larger than the requested ''h_vmem'' value for the job is ignored and the ''h_vmem'' limit gets used for both.  This is contrary to documentation.((This issue was fixed in a subsequent release of the Grid Engine in question.)).

The magnitude of the flaws and inconsistencies -- and the fact that on our HPC cluster multi-node jobs form a significant percentage of the workload -- meant that we could not make use of the native CGroup integration in this Grid Engine product, even though it was a primary reason for choosing the product.

<note warning>As of 2016-01-15 the GECO project has ceased development.  In each phase of testing we elucidated more and more inconsistencies (sometimes flaws) in the Grid Engine and Linux kernel implementations being used.</note>

===== Issues that Led to Termination of this Project =====

  * The Linux kernel being used on the compute nodes is 2.6.32-504.30.3.el6.x86_64.  There is a [[https://access.redhat.com/solutions/1489713|known bug]] in this kernel which manifests under heavy use of the cgroup facilities; this bug actually crashed the Farber head node the day after our 2016-01-06 maintenance and required that we revert the head node to 2.6.32-504.16.2.el6.x86_64.  With GECO running, compute nodes were hitting this same kernel bug at a high enough rate that we either had to cease using GECO or attempt to rollback the kernel in the compute nodes' boot images.
  * The Grid Engine epilog script was responsible for cleanup of the Cgroup environment for the job.  As early as June 2015 we had suspicions that Grid Engine wasn't **always** executing the epilog script, but the vendor assured us that could not happen.  However, it does happen:  if the user removes/renames the working directory for a running job, Grid Engine will not run the epilog script and Cgroup cleanup would not happen.  Orphaned cpuset Cgroups meant that the cores allocated to that job could not be bound to other jobs, and those subsequent jobs would fail to start.
  * With the release of Grid Engine that we are using, the vendor introduced a mechanism to support faster delivery of job information.  When doing a ''qstat'', a set of //read threads// in the queue master are being queried.  The read threads return the last-coalesced snapshot of the cluster's job state; updates of this information are controlled by a locking mechanism that surround the periodic scheduling run, etc.  This means that for a lengthy scheduling run, ''qstat'' won't actually return the real state information for a job that's been started running.  We had to introduce some complicated logic as well as some ''sleep()'' calls in order to fetch accurate job information.  But that wasn't even enough, as we later found ''qstat'' to be returning partially-accurate job information for large array jobs.  A fix to this flaw would have been necessary, but no solution other than more complexity and more ''sleep()'' usage presented itself.
  * The ''gecod'' program could not seem to reliably read ''cpuset.cpus'' from Cgroup subgroups it created.  A processor binding would be produced and successfully written to e.g. ''/cgroup/cpuset/GECO/132.4/cpuset.cpus''.  When ''gecod'' scheduled the next job it would read ''/cgroup/cpuset/GECO/132.4/cpuset.cpus'' in an attempt to determine what processors were available to the new job being scheduled.  However, when ''/cgroup/cpuset/GECO/132.4/cpuset.cpus'' was opened and read no data was present.
<note>We experimented the week of 2016-01-11 with adding various logic and ''sleep()'' delays around these reads, but to no consistently-reliable result.</note>
  * The GECO mechanism was also being used to gate ''ssh'' access to compute nodes.  When an ''sshd'' was started, the user id and the environment for the process were checked to determine whether or not the ''sshd'' should be killed or allowed to execute.  If the ''sshd'' were owned by ''root'' then by default nothing was done to impede or alter its startup.  Unfortunately, this meant that ''qlogin'' sessions -- which start an ''sshd'' under the ''sge-shepherd'' process for the job -- were never getting quarantined properly.  A fix for this would have been necessary for GECO to be fully-functional.
  * All of the ''sleep()'' blocks introduced to address issues with stale/unavailable data increased the time necessary for job quarantine to the point where in many cases (during long scheduling runs) Grid Engine reached its threshold for awaiting ''sge-shepherd'' startup for jobs and would simply mark them failed.

So while GECO worked quite well for us in small-scale testing on a subset of Farber's compute nodes, when scaled-up to our actual workload it failed utterly as the complexity of adapting to that scale under the conditions and software involved proved insurmountable.