Table of Contents

Grid Engine: Idle Queue Implementation using Subordination

One feature of Grid Engine I've only now discovered and tested is queue subordination. All existing clusters I've managed have used (in retrospect) a flat set of queues that were differentiated merely by the hosts on which instances were created or by available parallel environments. Recently I experimented (successfully) with leveraging the queue sequence number1) and slot-limiting resource quotas to partition the full complement of nodes equally among threaded and distributed parallel programs.

A proposed feature of the new UD community cluster is the sharing of idle cores, with those cores' owner(s) granted preferential access at any time. So while Professor X is writing his latest paper and not using his cores, Magneto can submit a job targeting idle resources and utilize those cores. However, if Professor X needs to rerun a few computations to satisfy a peer reviewer of his paper, Magneto's jobs will be killed2) to make cores available for Professor X.

Under this scheme, an idle queue spans all cluster resources, while one or more owner queues apply to specific nodes purchased by an investing entity. The idle queue is configured to be subordinate to all owner queues. The subordination is defined on a per-host basis, with a threshold indicating at what point idle jobs must make way for owner jobs:

qname:               profx_3day.q
hostlist:            @profx.hosts
  :
subordinate:         slots=NONE,[@profx.hosts=slots=2(idle.q:0:sr)]
  :

This subordinate directive states the following:

  1. total slots used across both queues on a host (in the @profx.hosts host list) is greater than 23)
  2. any of those slots are from the idle.q queue
  3. a job is eligible to run in profx_3day.q and requires less-than or equal-to the number of slots coming from idle.q

then begin suspending jobs running on that host via idle.q, starting with shortest accumulated runtime. By default, Grid Engine suspends a task by sending the SIGSTOP signal; the task can later resume execution by means of SIGCONT. This scheme does not evict the task from memory on the execution host and will not work properly for distributed parallel programs. It also precludes any possibility of "migrating" the evicted task(s) onto other available resources.

An Example

I setup two test queues on Strauss: 3day.q stands-in for an owner queue, and idle.q is the all-access idle queue. Both queues are present on a single host and have two execution slots. Suppose there is a single core being used by Professor X, and a single core by Magneto:

[frey@strauss ~]$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
     34 0.55500 x.qs       magneto      r     11/02/2011 15:01:34 idle.q@strauss.udel.edu            1        
     36 0.55500 a.qs       profx        r     11/02/2011 15:01:49 3day.q@strauss.udel.edu            1 

Magneto is hungry for CPU time, so he submits two additional jobs on the idle queue:

[magneto@strauss ~]$ qsub -q idle.q y.qs
Your job 37 ("x.qs") has been submitted
[magneto@strauss ~]$ qsub -q idle.q z.qs
Your job 38 ("x.qs") has been submitted
[magneto@strauss ~]$ qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
3day.q@strauss.udel.edu        BIP   0/1/2          0.54     sol-sparc64   
     36 0.55500 a.qs       profx        r     11/02/2011 15:01:49     1    
---------------------------------------------------------------------------------
idle.q@strauss.udel.edu        BIP   0/1/2          0.54     sol-sparc64   P
     34 0.55500 x.qs       magneto      r     11/02/2011 15:01:34     1        

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
     37 0.55500 y.qs       magneto      qw    11/02/2011 15:02:02     1        
     38 0.55500 z.qs       magneto      qw    11/02/2011 15:02:03     1 

The idle.q instance now shows state P – overload – exists. This state is produced by the subordinate clause that was added to the configuration for 3day.q: addition of another job to the idle queue instance would exceed the threshold. So the jobs must wait.

Suddenly, Professor X finds that the input to one of his tasks was incorrect, and he must recalculate one figure for his paper. He submits a job:

[profx@strauss ~]$ qsub -q 3day.q b.qs
Your job 39 ("x.qs") has been submitted
[profx@strauss ~]$ qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
3day.q@strauss.udel.edu        BIP   0/2/2          0.49     sol-sparc64   
     36 0.55500 a.qs       profx        r     11/02/2011 15:01:49     1        
     39 0.55500 b.qs       profx        r     11/02/2011 15:04:19     1    
---------------------------------------------------------------------------------
idle.q@strauss.udel.edu        BIP   0/0/2          0.49     sol-sparc64   P

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
     37 0.55500 y.qs       magneto      qw    11/02/2011 15:02:02     1        
     38 0.55500 z.qs       magneto      qw    11/02/2011 15:02:03     1 

Ah! Magneto's x.qs job has been evicted from idle.q on the host. Since idle.q was reconfigured to send SIGKILL instead of SIGSTOP, the offending job was outright terminated to make room for the owner's work.

We fast-forward several hours, and Professor X's a.qs job has completed:

[frey@strauss ~]$ qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
3day.q@strauss.udel.edu        BIP   0/1/2          0.48     sol-sparc64   
     39 0.55500 b.qs       profx        r     11/02/2011 15:04:19     1    
---------------------------------------------------------------------------------
idle.q@strauss.udel.edu        BIP   0/1/2          0.48     sol-sparc64   P
     37 0.55500 y.qs       magneto      r     11/02/2011 15:05:19     1        

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
     38 0.55500 z.qs       magneto      qw    11/02/2011 15:02:03     1 

This has opened-up a slot in idle.q which the waiting y.qs job consumes. Once the other job owned by Professor X completes:

[frey@strauss ~]$ qstat -f
queuename                      qtype resv/used/tot. load_avg arch          states
---------------------------------------------------------------------------------
3day.q@strauss.udel.edu        BIP   0/0/2          0.46     sol-sparc64   
---------------------------------------------------------------------------------
idle.q@strauss.udel.edu        BIP   0/2/2          0.46     sol-sparc64   P
     37 0.55500 y.qs       magneto      r     11/02/2011 15:05:19     1        
     38 0.55500 z.qs       magneto      r     11/02/2011 15:06:04     1 

the idle queue can be fully utilized by Magneto.

Next Steps

It is not immediately clear what interplay will manifest between slot-based resource quotas and subordination. Likewise, the subordination threshold should be summed across all N owner queues, where generally N is greater than one. The behavior of 3-day, 1-day, and indeterminate job length queues with idle subordination needs some careful testing.

1)
seqno attribute in a queue's configuration
2)
in time we hope to have process checkpointing implemented for proper suspension/migration rather than outright killing of idle processes
3)
assuming Professor X has ancient two-processor, single-core nodes