Several variants of Grid Engine claim to support integration with Linux CGroups to control resource allocations. One such variant was chosen for the University's Farber cluster partly for the sake of this feature. We were incredibly disappointed to eventually figure out for ourselves that the marketing hype was true in a very rudimentary sense. Some of the issues we elucidated were:
m_mem_free
. As implemented, the multiplication is applied twice and results in incorrect resource limits. E.g. requesting m_mem_free=2G
for a 20-core threaded job, the memory.limit_in_bytes applied by sge_execd
to the job's shepherd was 80 GB.qrsh -inherit
to a node never have CGroup limits applied to them.qrsh -inherit
(e.g. in a prolog script), the sge_execd
never adds the shepherd or its child processes to that CGroup.qmaster
; the sge_execd
native sensors do not provide feedback w.r.t. what cores are available/unavailable.qmaster
would still attempt to use themqmaster
selected cores for them.sge_execd
that the job ended in error1).m_mem_free
which is larger than the requested h_vmem
value for the job is ignored and the h_vmem
limit gets used for both. This is contrary to documentation.2).The magnitude of the flaws and inconsistencies – and the fact that on our HPC cluster multi-node jobs form a significant percentage of the workload – meant that we could not make use of the native CGroup integration in this Grid Engine product, even though it was a primary reason for choosing the product.
qstat
, a set of read threads in the queue master are being queried. The read threads return the last-coalesced snapshot of the cluster's job state; updates of this information are controlled by a locking mechanism that surround the periodic scheduling run, etc. This means that for a lengthy scheduling run, qstat
won't actually return the real state information for a job that's been started running. We had to introduce some complicated logic as well as some sleep()
calls in order to fetch accurate job information. But that wasn't even enough, as we later found qstat
to be returning partially-accurate job information for large array jobs. A fix to this flaw would have been necessary, but no solution other than more complexity and more sleep()
usage presented itself.gecod
program could not seem to reliably read cpuset.cpus
from Cgroup subgroups it created. A processor binding would be produced and successfully written to e.g. /cgroup/cpuset/GECO/132.4/cpuset.cpus
. When gecod
scheduled the next job it would read /cgroup/cpuset/GECO/132.4/cpuset.cpus
in an attempt to determine what processors were available to the new job being scheduled. However, when /cgroup/cpuset/GECO/132.4/cpuset.cpus
was opened and read no data was present.sleep()
delays around these reads, but to no consistently-reliable result.
ssh
access to compute nodes. When an sshd
was started, the user id and the environment for the process were checked to determine whether or not the sshd
should be killed or allowed to execute. If the sshd
were owned by root
then by default nothing was done to impede or alter its startup. Unfortunately, this meant that qlogin
sessions – which start an sshd
under the sge-shepherd
process for the job – were never getting quarantined properly. A fix for this would have been necessary for GECO to be fully-functional.sleep()
blocks introduced to address issues with stale/unavailable data increased the time necessary for job quarantine to the point where in many cases (during long scheduling runs) Grid Engine reached its threshold for awaiting sge-shepherd
startup for jobs and would simply mark them failed.So while GECO worked quite well for us in small-scale testing on a subset of Farber's compute nodes, when scaled-up to our actual workload it failed utterly as the complexity of adapting to that scale under the conditions and software involved proved insurmountable.