====== Grid Engine: Idle Queue Implementation using Subordination ====== One feature of Grid Engine I've only now discovered and tested is //queue subordination//. All existing clusters I've managed have used (in retrospect) a flat set of queues that were differentiated merely by the hosts on which instances were created or by available parallel environments. Recently I experimented (successfully) with leveraging the queue sequence number((''seqno'' attribute in a queue's configuration)) and slot-limiting //resource quotas// to partition the full complement of nodes equally among threaded and distributed parallel programs. A proposed feature of the new UD community cluster is the sharing of idle cores, with those cores' owner(s) granted preferential access at any time. So while Professor X is writing his latest paper and not using his cores, Magneto can submit a job targeting idle resources and utilize those cores. However, if Professor X needs to rerun a few computations to satisfy a peer reviewer of his paper, Magneto's jobs will be killed((in time we hope to have process checkpointing implemented for proper suspension/migration rather than outright killing of idle processes)) to make cores available for Professor X. Under this scheme, an //idle queue// spans all cluster resources, while one or more //owner queues// apply to specific nodes purchased by an investing entity. The //idle queue// is configured to be //subordinate// to all //owner queues//. The subordination is defined on a per-host basis, with a threshold indicating at what point idle jobs must make way for owner jobs: qname: profx_3day.q hostlist: @profx.hosts : subordinate: slots=NONE,[@profx.hosts=slots=2(idle.q:0:sr)] : This ''subordinate'' directive states the following: - total slots used across both queues on a host (in the ''@profx.hosts'' host list) is greater than 2((assuming Professor X has ancient two-processor, single-core nodes)) - any of those slots are from the ''idle.q'' queue - a job is eligible to run in ''profx_3day.q'' and requires less-than or equal-to the number of slots coming from ''idle.q'' then begin suspending jobs running on that host via ''idle.q'', starting with shortest accumulated runtime. By default, Grid Engine suspends a task by sending the ''SIGSTOP'' signal; the task can later resume execution by means of ''SIGCONT''. This scheme does not evict the task from memory on the execution host and will not work properly for distributed parallel programs. It also precludes any possibility of "migrating" the evicted task(s) onto other available resources. ===== An Example ===== I setup two test queues on Strauss: ''3day.q'' stands-in for an owner queue, and ''idle.q'' is the all-access idle queue. Both queues are present on a single host and have two execution slots. Suppose there is a single core being used by Professor X, and a single core by Magneto: [frey@strauss ~]$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 34 0.55500 x.qs magneto r 11/02/2011 15:01:34 idle.q@strauss.udel.edu 1 36 0.55500 a.qs profx r 11/02/2011 15:01:49 3day.q@strauss.udel.edu 1 Magneto is hungry for CPU time, so he submits two additional jobs on the idle queue: [magneto@strauss ~]$ qsub -q idle.q y.qs Your job 37 ("x.qs") has been submitted [magneto@strauss ~]$ qsub -q idle.q z.qs Your job 38 ("x.qs") has been submitted [magneto@strauss ~]$ qstat -f queuename qtype resv/used/tot. load_avg arch states --------------------------------------------------------------------------------- 3day.q@strauss.udel.edu BIP 0/1/2 0.54 sol-sparc64 36 0.55500 a.qs profx r 11/02/2011 15:01:49 1 --------------------------------------------------------------------------------- idle.q@strauss.udel.edu BIP 0/1/2 0.54 sol-sparc64 P 34 0.55500 x.qs magneto r 11/02/2011 15:01:34 1 ############################################################################ - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS ############################################################################ 37 0.55500 y.qs magneto qw 11/02/2011 15:02:02 1 38 0.55500 z.qs magneto qw 11/02/2011 15:02:03 1 The ''idle.q'' instance now shows state ''P'' -- overload -- exists. This state is produced by the ''subordinate'' clause that was added to the configuration for ''3day.q'': addition of another job to the idle queue instance would exceed the threshold. So the jobs must wait. Suddenly, Professor X finds that the input to one of his tasks was incorrect, and he must recalculate one figure for his paper. He submits a job: [profx@strauss ~]$ qsub -q 3day.q b.qs Your job 39 ("x.qs") has been submitted [profx@strauss ~]$ qstat -f queuename qtype resv/used/tot. load_avg arch states --------------------------------------------------------------------------------- 3day.q@strauss.udel.edu BIP 0/2/2 0.49 sol-sparc64 36 0.55500 a.qs profx r 11/02/2011 15:01:49 1 39 0.55500 b.qs profx r 11/02/2011 15:04:19 1 --------------------------------------------------------------------------------- idle.q@strauss.udel.edu BIP 0/0/2 0.49 sol-sparc64 P ############################################################################ - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS ############################################################################ 37 0.55500 y.qs magneto qw 11/02/2011 15:02:02 1 38 0.55500 z.qs magneto qw 11/02/2011 15:02:03 1 Ah! Magneto's ''x.qs'' job has been evicted from ''idle.q'' on the host. Since ''idle.q'' was reconfigured to send ''SIGKILL'' instead of ''SIGSTOP'', the offending job was outright terminated to make room for the owner's work. We fast-forward several hours, and Professor X's ''a.qs'' job has completed: [frey@strauss ~]$ qstat -f queuename qtype resv/used/tot. load_avg arch states --------------------------------------------------------------------------------- 3day.q@strauss.udel.edu BIP 0/1/2 0.48 sol-sparc64 39 0.55500 b.qs profx r 11/02/2011 15:04:19 1 --------------------------------------------------------------------------------- idle.q@strauss.udel.edu BIP 0/1/2 0.48 sol-sparc64 P 37 0.55500 y.qs magneto r 11/02/2011 15:05:19 1 ############################################################################ - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS ############################################################################ 38 0.55500 z.qs magneto qw 11/02/2011 15:02:03 1 This has opened-up a slot in ''idle.q'' which the waiting ''y.qs'' job consumes. Once the other job owned by Professor X completes: [frey@strauss ~]$ qstat -f queuename qtype resv/used/tot. load_avg arch states --------------------------------------------------------------------------------- 3day.q@strauss.udel.edu BIP 0/0/2 0.46 sol-sparc64 --------------------------------------------------------------------------------- idle.q@strauss.udel.edu BIP 0/2/2 0.46 sol-sparc64 P 37 0.55500 y.qs magneto r 11/02/2011 15:05:19 1 38 0.55500 z.qs magneto r 11/02/2011 15:06:04 1 the idle queue can be fully utilized by Magneto. ===== Next Steps ===== It is not immediately clear what interplay will manifest between slot-based resource quotas and subordination. Likewise, the subordination threshold should be summed across all //N// owner queues, where generally //N// is greater than one. The behavior of 3-day, 1-day, and indeterminate job length queues with idle subordination needs some careful testing.