abstract:farber:runjobs:queues

The job queues on Farber

Each investing-entity on a cluster an owner queue that exclusively use the investing-entity's compute nodes. (They do not use any nodes belonging to others.) Grid Engine allows those queues to be selected only by members of the investing-entity's group.

There are also node-wise queues, standby, standby-4h and spillover. Grid Engine allows users to use standard nodes belonging to other investing-entities.

When submitting a batch job to Grid Engine, you specify the resources you need or want for your job. You don't typically specify the name of the queue. Instead, you include a set of directives that specify your job's characteristics. Grid Engine then chooses the most appropriate queue that meets those needs.

The queue to which a job is assigned depends primarily on six factors:

  • Whether the job is serial or parallel
  • Which parallel environment (e.g., mpi, threads) is needed
  • Which or how much of a resource is needed (e.g., max clock time, memory requirements)
  • Resources your job will consume (e.g. an entire node, max memory usage)
  • Whether the job is non-interactive or interactive

For each investing-entity, the owner-queue names start with the investing-entity's name:

«investing_entity».q The default queue for all jobs.
standby.q A special queue that spans all standard nodes, at most 200 slots per user. Submissions will have a lower priority than jobs submitted to owner-queues, and standby jobs will only be started on lightly-loaded nodes. These jobs will not be preempted by others' job submissions. Jobs will be terminated with notification after running for 8 hours of elapsed (wall-clock) time. Also see the standby-4h.q entry.
You must specify –l standby=1 as a qsub option. You must also use the -notify option if your jobs traps the USR2 termination signal.
standby-4h.q A special queue that spans all standard nodes, at most 800 slots per user. Submissions will have a lower priority than jobs submitted to owner-queues, and 4hr standby jobs will only be started on lightly-loaded nodes. These jobs will not be preempted by others' job submissions. Jobs will be terminated with notification after running for 4 hours of elapsed (wall-clock) time.
You must specify –l standby=1 as a qsub option. And, if more than 200 slots are requested, you must also specify a maximum run-time of 4 hours or less via the -l h_rt=hh:mm:ss option. Finally, use the -notify option if your jobs traps the USR2 termination signal.
spillover.q A special queue that spans all standard nodes and is used by Grid Engine to map jobs when requested resources are unavailable on standard nodes in owner queues, e.g., other standby or spillover jobs are using owner resources.
In addition to the required run-time limit, the standby queues differ by how many jobs each user can be concurrently running. More jobs can be run in the standby queues requiring shorter time limits.

Farber has two standby queues, standby.q and standby-4h.q, that span all “standard” nodes on the cluster. A “standard” node has 20 cores and 64GB of memory.

The “standby” queues span nodes of the cluster to allow you to use unused compute cycles on other groups' nodes in addition to those from your own group's nodes. All standby jobs have a run-time limit specific to a cluster in order to be fair to node owners.

Grid Engine preferentially allocates standby slots on nodes that are lightly loaded. It assigns these jobs a lower queue-priority than jobs submitted by members of the group owning the node. Consequently, normal jobs will tend to push ahead of standby jobs and generally execute before them. Once your job is running, it will not be kill or suspended before the specified run-time limit has been reached.

Specify the standby resource in the qsub command or in a job script to target the standby queues. For example,

qsub -l standby=1 ...

The “standby” queues are assigned based on the number of slots and the maximum (wall-clock) hard run-time you specify. Typically each cluster defines a default hard run-time limit as well as a maximum number of slots allowed per user for standby jobs.

The differences between the two queues is tied to the number of slots and the maximum (wall-clock) hard run-time you specify.

  • If you specify a maximum run-time of 4 hours or less (e.g.,h_rt=4:00:00), your job will be assigned to standby-4h.q.
  • If you do not specify a maximum run-time or if you specify a run-time greater than 4 hours but not exceeding 8 hours, then you may request up to 200 slots for any job. The job will be assigned to standby.q.

The total number of concurrent slots in the standby.q jobs may not exceed 200, and the total in both standby.q and standby-4h.q may not exceed 800. The limit is by user, and all other jobs will wait for your jobs to finish.

For example, you could concurrently run 25 20-slot jobs (500 slots). This would leave 300 slots available for any other concurrent standby jobs you may submit.

Job script example:

#
# The standby flag asks to run the job in a standby queue.
#$ -l standby=1
#
# This job needs an openmpi parallel environment using 500 slots.
#$ -pe openmpi 500
#
# The h_rt flag specifies a 4-hr maximum (hard) run-time limit.
# The flag is required because the job needs more than 240 slots.
#$ -l h_rt=4:00:00
...

Once Grid Engine determines the appropriate standby queue, it maps the job to available, idle, nodes (hosts) to fill all the slots. For openmpi jobs, Grid Engine is configured to use the fill up allocation rule, by default. This will keep the number of nodes down, and thus reduce the amount of inter-node communication.

It may be useful to control the number of nodes and the number of processes per node. For example:

 qsub -l standby,h_rt=4:00:00 myopenmpi.qs

The MPI processes (ranks) will be mapped to ceiling(NSLOTS/20) hosts. All hosts, except possibly the last host, will have exactly 20 processes. The MPI_FLAG –display_map will display details of the allocations in your Grid Engine output. The MPI_FLAG –loadbalance will spread the processes out among the assigned hosts, without changing the number of hosts.

If you do not specify an exact multiple of 20, there will be slots available to other queued Grid Engine jobs. If these jobs are assigned to your nodes (you have them for up to 8 hours), they will compete for shared resources:
Resource Shared
cores 20
memory 64 GBs

If you add the exclusive resource in your queue script file

 #$ -l exclusive=1

then Grid Engine will round up your slot request to a multiple of 20 and thus keep other jobs off the node.

The allocation rule and the group names are configured in Grid Engine. You can use the qconf command to show the current configuration.

To see the current allocation rule for mpi:

 $ qconf -sp mpi | grep allocation_rule
 allocation_rule    $fill_up

To see a list of all group names:

 $ qconf -shgrpl
 @128G
 @256G
 ... 

To see the nodes in a group name:

 $ qconf -shgrp @128G
 group_name @128G
 hostlist n068 n069 n070 n071 n106 n108 n109 n110 n113

When a standby job reaches its maximum run time, Grid Engine kills the job. The process depends on your use of Grid Engine's -notify flag and how your job handles the resulting USR2 notification signal.

  • Deferred termination: If you include the qsub -notify flag and the job catches the USR2 signal, your job gets an additional 5 minutes of run time during which it can checkpoint itself, log job progress, etc.
  • Immediate termination: If your job doesn't include the -notify flag or does not catch the USR2 signal, your job will terminate immediately.
Jobs that can react to kill-notification (the USR2 signal) should supply the -notify flag to qsub.

This bash script example catches USR2 and attempts to copy the contents of the job's scratch directory for later use when restarting the job:

#
# The -notify flag to qsub asks that we get a USR2 signal before being killed:
#$ -notify
#
# The -l standby=1 flag asks to run in a standby queue.
#$ -l standby=1
#
#$ -m eas
#$ -M traine@udel.edu
#
 
function copyScratchToLocal {
  echo -n `date`
  echo "  Copying $TMPDIR to "`pwd`
  cp -R $TMPDIR .
}
 
trap copyScratchToLocal SIGUSR2
 
date
ls /etc > $TMPDIR/ls-etc
while [ 1 -eq 1 ]; do
  sleep 10000
done

Submitting this job, the following output is produced:

Fri Jun  1 12:38:51 EDT 2012
User defined signal 2
Fri Jun 1 16:38:53 EDT 2012  Copying /scratch/39866.1.standby.q to /lustre/work/it_css/projects/standby

At two seconds past the max run time (8 hours in this case), the USR2 signal was delivered and the copyScratchToLocal shell function was executed. The job report that Grid Engine emailed when it finally killed this job shows the total run time:

Job 39866.1 (basic.qs) Aborted
Exit Status      = 137
Signal           = KILL
User             = traine
Queue            = standby.q@n015
Host             = n015.farber.hpc.udel.edu
Start Time       = 06/01/2012 12:38:51
End Time         = 06/01/2012 16:43:53
CPU              = 00:00:00
Max vmem         = 17.055M
failed assumedly after job because:
job 39866.1 died through signal KILL (9)

The job ran 5 minutes beyond the 8-hour mark, demonstrating that Grid Engine will indeed wait a full 5 minutes before delivering the KILL signal to end the job.

The default reaction of a program to the USR2 signal is to terminate. So if your job script does NOT catch the USR2 signal, it will terminate immediately and take the rest of the job with it!
The following information comes from a lengthy troubleshooting session with a farber user. IT thanks the user for his patience while all the details were hashed out.

When a program that does not handle the USR2 signal receives that signal, the default action is to abort. This will raise a CHLD signal in the program's parent process. Typically, a Grid Engine job script (the parent process) will react to CHLD by itself aborting.

Consider an MPI program that does not include the additional code necessary to catch the USR2 signal and gracefully shut itself down (or at least ignore the signal). The job script may look something like this:

  :
function runtimeLimitReached {
  # Do some cleanup, job archiving, whatever, then exit
  # to end the job
 
  exit 0
}
  :
trap runtimeLimitIsReached SIGUSR2
mpirun {arguments to mpirun...}
 
# Do any cleanup that should be done if the job finished prior
# to the runtime limit...

What happens when the runtime limit is reached?

  1. The job script and its child mpirun process receive USR2
    • The job script is blocked waiting on the mpirun process, so processing of USR2 is delayed until mpirun has exited
  2. Meanwhile, the mpirun process forwards the USR2 signal to all worker processes
  3. The worker processes do NOT handle USR2 so they abort upon receiving USR2
  4. The mpirun process sees its workers aborting, so it logs some error messages and aborts itself
  5. The job script sees the CHLD signal thrown by mpirun and reacts to it by exiting immediately

So the job script never gets a chance to react to the USR2 signal because the CHLD signal coming from mpirun superseded it.

If the job script ignores CHLD then the mpirun would inherit that behavior, ignore all of its children when they abort due to their receiving USR2, and never return control to the job script! This means the job script would still never get a chance to react to USR2.

Without altering the MPI program to handle USR2, there is a way to allow the job script to react to USR2 while the MPI program continues to execute. The job script must be written such that:

  1. The job shell has a USR2 signal handler set
  2. The mpirun process ignores USR2 (and thus does not propagate it to the worker processes)
  3. The mpirun process is executed in such a way as to not block

The sequencing of the commands is the important part:

  :
function runtimeLimitReached {
  # Do some cleanup, job archiving, whatever, then exit
  # to end the job
 
  exit 0
}
  :
trap "" SIGUSR2
mpirun {arguments to mpirun...} &
trap runtimeLimitReached SIGUSR2
wait
 
# Do any cleanup that should be done if the job finished prior
# to the runtime limit...

The first trap command tells the job shell (and child processes) to ignore the USR2 signal. Appending the ampersand (&) character to the mpirun command runs it in the background and the job script continues to execute immediately. Thus, mpirun has been started and will ignore USR2 signals. Now, the job shell is set to execute the runtimeLimitReached function when it itself receives USR2 – this does not change the behavior of the mpirun that is executing in the background, though (it will still ignore USR2). Finally, the wait command puts the job shell to sleep (no CPU usage) until all child processes have exited (namely, mpirun). The wait command is a function built-in to BASH and not an external program, so it does not cause the job shell to block and delay reaction to USR2 like mpirun did in the previous example.

If a job is submitted with the -l exclusive=1 resource, Grid Engine will

  • promote any serial jobs to 20-core threaded (-pe threads 20)
  • modify any parallel jobs to round-up the slot count to the nearest multiple of 20
  • ignore any memory resources and make all memory available on all nodes assigned to the job

A job running on a node with -l exclusive=1 will block any other jobs from making use of resources on that host.

Job script example:

#
# The exclusive flag asks to run this job only on all nodes required to fulfill requested slots
#$ -l exclusive=1
#
# This job needs an openmpi parallel environment using 32 slots = 2 nodes exclusively.
#$ -pe openmpi 32
#
# By default the slot count granted by Grid Engine will be
# used, one MPI worker per slot.  Set this variable if you
# want to use fewer cores than Grid Engine granted you (e.g.
# when using exclusive=1):
#
#WANT_NPROC=0
 
...
In the script example, this job would be rounded up to 40 and would be assigned 2 nodes. If you really want your job to run with only 32 slots, uncomment and set WANT_NPROC=32.

Grid Engine is configured to “fill up” nodes by allocating as many slots as possible before proceeding to another node to fulfill the total number of requested slots for the job. Unfortunately, Grid Engine may not do what you expect, evenly distribute and fill up across the total number of nodes needed for your job. For example, if you submit four Open MPI jobs each requesting 20 slots and there are four free nodes each with 20 cores on the cluster, you would expect each job to be assigned on a single node, but in fact, the first job may land on a single node, the second, third, and fourth may wind up straddling the remaining three nodes.

To assure that your job will be the only job running on a node (or all nodes needed to satisfy the slots requested), specify the exclusive resource in the qsub or qlogin command, or in a job script. For example,

qsub -l exclusive=1 ...

If a job is submitted with the general resource, Grid Engine will

  • promote any serial jobs to 20-core threaded (-pe threads 20)
  • modify any parallel jobs to round-up the slot count to the nearest multiple of 20
  • ignore any memory resources and make all memory available on all nodes assigned to the job

A job running on a node with -l exclusive=1 will block any other jobs from making use of resources on that host.

Job script example:

#
# The exclusive flag asks to run this job only on all nodes required to fulfill requested slots
#$ -l exclusive=1
#
# This job needs an openmpi parallel environment using 32 slots = 2 nodes exclusively.
#$ -pe openmpi 32
#
# By default the slot count granted by Grid Engine will be
# used, one MPI worker per slot.  Set this variable if you
# want to use fewer cores than Grid Engine granted you (e.g.
# when using exclusive=1):
#
#WANT_NPROC=0
 
...
In the script example, this job would be rounded up to 40 and would be assigned 2 nodes. If you really want your job to run with only 32 slots, uncomment and set WANT_NPROC=32.
  • abstract/farber/runjobs/queues.txt
  • Last modified: 2018-10-08 16:01
  • by anita