Full Support for Linux CGroups Using GECO

Several variants of Grid Engine claim to support integration with Linux CGroups to control resource allocations. One such variant was chosen for the University's Farber cluster partly for the sake of this feature. We were incredibly disappointed to eventually figure out for ourselves that the marketing hype was true in a very rudimentary sense. Some of the issues we elucidated were:

The magnitude of the flaws and inconsistencies – and the fact that on our HPC cluster multi-node jobs form a significant percentage of the workload – meant that we could not make use of the native CGroup integration in this Grid Engine product, even though it was a primary reason for choosing the product.

As of 2016-01-15 the GECO project has ceased development. In each phase of testing we elucidated more and more inconsistencies (sometimes flaws) in the Grid Engine and Linux kernel implementations being used.

Issues that Led to Termination of this Project

We experimented the week of 2016-01-11 with adding various logic and sleep() delays around these reads, but to no consistently-reliable result.

So while GECO worked quite well for us in small-scale testing on a subset of Farber's compute nodes, when scaled-up to our actual workload it failed utterly as the complexity of adapting to that scale under the conditions and software involved proved insurmountable.

1) , 2)
This issue was fixed in a subsequent release of the Grid Engine in question.