abstract:mills:runjobs:queues

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
abstract:mills:runjobs:queues [2018-05-22 07:59] – created sraskarabstract:mills:runjobs:queues [2018-06-19 23:06] (current) – [The job queues on Mills] anita
Line 1: Line 1:
 +===== The job queues on Mills =====
 +
 +Each investing-entity on a cluster has four //owner queues// that exclusively use the investing-entity's compute nodes. (They do not use any nodes belonging to others.) Grid Engine allows those queues to be selected only by members of the investing-entity's group.
 +
 +There are also node-wise queues, //standby//, //standby-4h//, //spillover-24core//, //spillover-48core// and //idle// Grid Engine allows users to use nodes belonging to other investing-entities. (The idle queue is currently disabled.)
 +
 +When submitting a batch job to Grid Engine, you specify the resources you need or want for your job. **//You don't actually specify the name of the queue//**. Instead, you include a set of directives that specify your job's characteristics. Grid Engine then chooses the most appropriate queue that meets those needs.
 +
 +The queue to which a job is assigned depends primarily on six factors:
 +
 +  * Whether the job is serial or parallel
 +  * Which parallel environment (e.g., openmpi, threads) is needed
 +  * Which or how much of a resource is needed (e.g., max clock time, max memory)
 +  * Whether the job can be suspended and restarted by the system.
 +  * Whether the job is non-interactive or interactive
 +  * Whether you want to use idle nodes belonging to others.
 +
 +For each investing-entity, the **owner-queue** names start with the investing-entity's name:
 +
 +^   <<//investing_entity//>>''.q+''  | The default queue for non-interactive serial or parallel jobs. The primary queue for long-running jobs. These jobs must be able to be suspended and restarted by Grid Engine. They can be preempted by jobs submitted to the //development// queue, described next. Examples: all serial (single-core) jobs, openMPI jobs, openMP jobs or other jobs using the threads parallel environment. |
 +^  <<//investing_entity//>>''.q''  | A special queue for __non-suspendable__ parallel jobs, such as MPICH. These jobs will not be preempted by others' job submissions. |
 +^  <<//investing_entity//>>''-qrsh.q''  | A special queue for interactive jobs only. Jobs are scheduled to this queue when you use Grid Engine's **qlogin** command. |
 +^  ''standby.q''  | A special queue that spans all nodes, at most 240 slots per user.   Submissions will have a lower priority than jobs submitted to owner-queues, and standby jobs will only be started on lightly-loaded nodes.  These jobs will not be preempted by others' job submissions. Jobs will be terminated with notification after running for 8 hours of elapsed (wall-clock) time.  //Also see the ''standby-4h.q'' entry.//  |
 +^  ::: | You must specify **–l standby=1** as a **qsub** option. You must also use the **-notify** option if your jobs traps the USR2 termination signal. |
 +^  ''standby-4h.q''  | A special queue that spans all nodes, at most 816 slots per user.   Submissions will have a lower priority than jobs submitted to owner-queues, and standby jobs will only be started on lightly-loaded nodes.  These jobs will not be preempted by others' job submissions. Jobs will be terminated with notification after running for 4 hours of elapsed (wall-clock) time. |
 +^  ::: | You must specify **–l standby=1** as a **qsub** option. And, if more than 240 slots are requested, you must also specify a maximum run-time of 4 hours or less via the **-l h_rt=//hh:mm:ss//** option. Finally, use the **-notify** option if your jobs traps the USR2 termination signal. |
 +^  ''spillover-24core.q''  | A special queue that spans all standard nodes (24 cores) and is used by Grid Engine to map jobs when requested resources are unavailable on standard nodes in owner queues, e.g., node failure or other standby jobs are using owner resources. **Implemented on February 29, 2016** according to [[https://sites.udel.edu/research-computing/files/2016/01/MillsEnd-of-LifePlanandPolicies-3-1jp8lqd.pdf|Mills End-of-Life Policy]].|
 +^  ''spillover-48core.q''  | A special queue that spans all 4-socket nodes (48 cores) and is used by Grid Engine to map jobs when requested resources are unavailable on 48-core nodes in owner queues, e.g., node failure or other standby jobs are using owner resources. Owners of only 48-core nodes will not spillover to standard nodes. **Implemented on February 29, 2016** according to [[https://sites.udel.edu/research-computing/files/2016/01/MillsEnd-of-LifePlanandPolicies-3-1jp8lqd.pdf|Mills End-of-Life Policy]].|
 +^  ''spare.q''  | A special queue that spans all nodes kept in reserve as replacements for failed owner-nodes. Temporary access to the spare nodes will be granted by request. When access is granted, the spare nodes will augment your owner nodes.  Jobs on the spare nodes will not be preempted by others' job submissions, but may needed to be killed by IT. The owner of a job running on a spare node will be notified by email two hours before IT kills the job. |
 +
 +<note tip>
 +Be considerate in your use of the development queue. It may preempt '**q+**' jobs being run by other users in your group if those jobs' computational resources are needed.
 +</note>
 +
 ===== Spare nodes ===== ===== Spare nodes =====
  
Line 11: Line 45:
 <note>If jobs running on spare nodes do need to be killed, IT will provide two hours notice via email to the jobs' owners.</note> <note>If jobs running on spare nodes do need to be killed, IT will provide two hours notice via email to the jobs' owners.</note>
  
-**Details by cluster** 
- 
-  * Configured on the [[:clusters:mills:runapps#the-job-queues|Mills cluster]] only. 
 ==== Requesting access ==== ==== Requesting access ====
  
  • abstract/mills/runjobs/queues.1526990373.txt.gz
  • Last modified: 2018-05-22 07:59
  • by sraskar