abstract:farber:runjobs:queues

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
abstract:farber:runjobs:queues [2018-05-22 07:53] sraskarabstract:farber:runjobs:queues [2018-10-08 16:01] (current) anita
Line 1: Line 1:
-===== The job queues on Farber =====+====== The job queues on Farber ======
  
 Each investing-entity on a cluster an //owner queue// that exclusively use the investing-entity's compute nodes. (They do not use any nodes belonging to others.) Grid Engine allows those queues to be selected only by members of the investing-entity's group. Each investing-entity on a cluster an //owner queue// that exclusively use the investing-entity's compute nodes. (They do not use any nodes belonging to others.) Grid Engine allows those queues to be selected only by members of the investing-entity's group.
Line 19: Line 19:
 ^   <<//investing_entity//>>''.q''  | The default queue for all jobs.| ^   <<//investing_entity//>>''.q''  | The default queue for all jobs.|
 ^  ''standby.q''  | A special queue that spans all standard nodes, at most 200 slots per user.   Submissions will have a lower priority than jobs submitted to owner-queues, and standby jobs will only be started on lightly-loaded nodes.  These jobs will not be preempted by others' job submissions. Jobs will be terminated with notification after running for 8 hours of elapsed (wall-clock) time.  //Also see the ''standby-4h.q'' entry.//  | ^  ''standby.q''  | A special queue that spans all standard nodes, at most 200 slots per user.   Submissions will have a lower priority than jobs submitted to owner-queues, and standby jobs will only be started on lightly-loaded nodes.  These jobs will not be preempted by others' job submissions. Jobs will be terminated with notification after running for 8 hours of elapsed (wall-clock) time.  //Also see the ''standby-4h.q'' entry.//  |
-^  ::: | You must specify **–l standby=1** as a **qsub** option. You must also use the **-notify** option if your jobs traps the USR2 termination signal. [[general:jobsched:standby |(Details)]] |+^  ::: | You must specify **–l standby=1** as a **qsub** option. You must also use the **-notify** option if your jobs traps the USR2 termination signal.|
 ^  ''standby-4h.q''  | A special queue that spans all standard nodes, at most 800 slots per user.   Submissions will have a lower priority than jobs submitted to owner-queues, and 4hr standby jobs will only be started on lightly-loaded nodes.  These jobs will not be preempted by others' job submissions. Jobs will be terminated with notification after running for 4 hours of elapsed (wall-clock) time. | ^  ''standby-4h.q''  | A special queue that spans all standard nodes, at most 800 slots per user.   Submissions will have a lower priority than jobs submitted to owner-queues, and 4hr standby jobs will only be started on lightly-loaded nodes.  These jobs will not be preempted by others' job submissions. Jobs will be terminated with notification after running for 4 hours of elapsed (wall-clock) time. |
-^  ::: | You must specify **–l standby=1** as a **qsub** option. And, if more than 200 slots are requested, you must also specify a maximum run-time of 4 hours or less via the **-l h_rt=//hh:mm:ss//** option. Finally, use the **-notify** option if your jobs traps the USR2 termination signal. [[general:jobsched:standby |(Details)]] |+^  ::: | You must specify **–l standby=1** as a **qsub** option. And, if more than 200 slots are requested, you must also specify a maximum run-time of 4 hours or less via the **-l h_rt=//hh:mm:ss//** option. Finally, use the **-notify** option if your jobs traps the USR2 termination signal.|
 ^  ''spillover.q''  | A special queue that spans all standard nodes and is used by Grid Engine to map jobs when requested resources are unavailable on standard nodes in owner queues, e.g., other standby or spillover jobs are using owner resources. | ^  ''spillover.q''  | A special queue that spans all standard nodes and is used by Grid Engine to map jobs when requested resources are unavailable on standard nodes in owner queues, e.g., other standby or spillover jobs are using owner resources. |
  
Line 28: Line 28:
 </note> </note>
  
-===== The "standby" queues =====+===== Farber "standby" queues ===== 
 + 
 +Farber has two standby queues, ''standby.q'' and ''standby-4h.q'', that span all **"standard"** nodes on the cluster. A "standard" node has 20 cores and 64GB of memory.
  
 The "standby" queues span nodes of the cluster to allow you to use unused compute cycles on other groups' nodes in addition to those from your own group's nodes. All standby jobs have a run-time limit specific to a cluster in order to be fair to node owners. The "standby" queues span nodes of the cluster to allow you to use unused compute cycles on other groups' nodes in addition to those from your own group's nodes. All standby jobs have a run-time limit specific to a cluster in order to be fair to node owners.
Line 40: Line 42:
 </code> </code>
  
-====Grid Engine resources governing these queues====+==== Grid Engine resources governing these queues ==== 
 The "standby" queues are assigned based on the number of slots and the maximum (wall-clock) //hard// run-time you specify. Typically each cluster defines a default //hard// run-time limit as well as a maximum number of slots allowed per user for standby jobs. The "standby" queues are assigned based on the number of slots and the maximum (wall-clock) //hard// run-time you specify. Typically each cluster defines a default //hard// run-time limit as well as a maximum number of slots allowed per user for standby jobs.
 +
 +The differences between the two queues is tied to the number of slots and the maximum (wall-clock) //hard// run-time you specify. 
 +
 +    * If you specify a maximum run-time of 4 hours or less (e.g.,''h_rt=4:00:00''), your job will be assigned to ''standby-4h.q''
 +
 +    * If you do **not** specify a maximum run-time **or** if you specify a run-time greater than 4 hours but not exceeding 8 hours, then you may request up to 200 slots for any job.  The job will be assigned to ''standby.q''
 +
 +The total number of concurrent slots in the ''standby.q'' jobs may not exceed 200, and the total in both ''standby.q'' and ''standby-4h.q'' may not exceed 800.  The limit is by user, and all other jobs will wait for your jobs to finish.
 +
 +For example, you could concurrently run 25 20-slot jobs (500 slots). This would leave 300 slots available for any other concurrent standby jobs you may submit.
 +
 +Job script example:
 +<code bash>
 +#
 +# The standby flag asks to run the job in a standby queue.
 +#$ -l standby=1
 +#
 +# This job needs an openmpi parallel environment using 500 slots.
 +#$ -pe openmpi 500
 +#
 +# The h_rt flag specifies a 4-hr maximum (hard) run-time limit.
 +# The flag is required because the job needs more than 240 slots.
 +#$ -l h_rt=4:00:00
 +...
 +</code>
  
 ==== Mapping jobs to nodes ==== ==== Mapping jobs to nodes ====
  
-Once Grid Engine determines the appropriate standby queue, it maps the job to available, idle, nodes (hosts) to fill all the slots. For MPI jobs, Grid Engine is configured to use the //fill up// allocation rule, by default.  This will keep the number of nodes down, and may reduce the amount of inter-node communication. However in some cases, it may be useful to control the number of nodes and the number of processes per node for MPI jobs.+Once Grid Engine determines the appropriate standby queue, it maps the job to available, idle, nodes (hosts) to fill all the slots. For openmpi jobs, Grid Engine is configured to use the //fill up// allocation rule, by default.  This will keep the number of nodes down, and thus reduce the amount of inter-node communication. 
 + 
 +It may be useful to control the number of nodes and the number of processes per node.      
 +For example: 
 +   qsub -l standby,h_rt=4:00:00 myopenmpi.qs 
 +The MPI processes (ranks) will be mapped to **''ceiling(NSLOTS/20)''** hosts.  All hosts, except possibly the last host, will have exactly 20 processes. 
 +The MPI_FLAG ''--display_map'' will display details of the allocations in your Grid Engine output.  The MPI_FLAG ''--loadbalance'' will spread the processes out among the assigned hosts, without changing the number of hosts.  
 + 
 +<note important>If you do not specify an exact multiple of 20, there will be slots available to other queued Grid Engine jobs. 
 +If these jobs are assigned to your nodes (you have them for up to 8 hours), they will compete for shared resources: 
 +^ Resource ^ Shared ^ 
 +| cores | 20 | 
 +| memory |64 GBs  | 
 +If you add the ''exclusive'' resource in your queue script file 
 +   #$ -l exclusive=1 
 +then Grid Engine will round up your slot request to a multiple of 20 and thus keep other jobs off the node. 
 +</note>  
 + 
 +<note tip>The allocation rule and the group names are configured in Grid Engine.  You can use the ''qconf'' command to show 
 +the current configuration. 
 + 
 +To see the current allocation rule for ''mpi'': 
 + 
 +   $ qconf -sp mpi | grep allocation_rule 
 +   allocation_rule    $fill_up
  
-====Actions at the run-time limit====+To see a list of all group names: 
 + 
 +   $ qconf -shgrpl 
 +   @128G 
 +   @256G 
 +   ...  
 + 
 +To see the nodes in a group name: 
 +   $ qconf -shgrp @128G 
 +   group_name @128G 
 +   hostlist n068 n069 n070 n071 n106 n108 n109 n110 n113 
 +</note> 
 + 
 +==== Actions at the run-time limit =====
 When a standby job reaches its maximum run time, Grid Engine kills the job. The process depends on your use of Grid Engine's ''-notify'' flag and how your job handles the resulting ''USR2'' notification signal. When a standby job reaches its maximum run time, Grid Engine kills the job. The process depends on your use of Grid Engine's ''-notify'' flag and how your job handles the resulting ''USR2'' notification signal.
  
Line 99: Line 164:
 User             = traine User             = traine
 Queue            = standby.q@n015 Queue            = standby.q@n015
-Host             = n015.mills.hpc.udel.edu+Host             = n015.farber.hpc.udel.edu
 Start Time       = 06/01/2012 12:38:51 Start Time       = 06/01/2012 12:38:51
 End Time         = 06/01/2012 16:43:53 End Time         = 06/01/2012 16:43:53
Line 114: Line 179:
 ==== What if my program does not catch USR2? ==== ==== What if my program does not catch USR2? ====
  
-<note>The following information comes from a lengthy troubleshooting session with a Mills user.  IT thanks the user for his patience while all the details were hashed out.</note>+<note>The following information comes from a lengthy troubleshooting session with a farber user.  IT thanks the user for his patience while all the details were hashed out.</note>
  
 When a program that does not handle the ''USR2'' signal receives that signal, the default action is to abort.  This will raise a ''CHLD'' signal in the program's parent process.  Typically, a Grid Engine job script (the parent process) will react to ''CHLD'' by itself aborting. When a program that does not handle the ''USR2'' signal receives that signal, the default action is to abort.  This will raise a ''CHLD'' signal in the program's parent process.  Typically, a Grid Engine job script (the parent process) will react to ''CHLD'' by itself aborting.
Line 179: Line 244:
 The first ''trap'' command tells the job shell (and child processes) to ignore the ''USR2'' signal.  Appending the ampersand (&) character to the ''mpirun'' command runs it in the background and the job script continues to execute immediately.  Thus, ''mpirun'' has been started and will ignore ''USR2'' signals.  Now, the job shell is set to execute the ''runtimeLimitReached'' function when it itself receives ''USR2'' -- this //does not// change the behavior of the ''mpirun'' that is executing in the background, though (it will still ignore ''USR2'').  Finally, the ''wait'' command puts the job shell to sleep (no CPU usage) until all child processes have exited (namely, ''mpirun'').  The ''wait'' command is a function built-in to BASH and not an external program, so it does not cause the job shell to block and delay reaction to ''USR2'' like ''mpirun'' did in the previous example. The first ''trap'' command tells the job shell (and child processes) to ignore the ''USR2'' signal.  Appending the ampersand (&) character to the ''mpirun'' command runs it in the background and the job script continues to execute immediately.  Thus, ''mpirun'' has been started and will ignore ''USR2'' signals.  Now, the job shell is set to execute the ''runtimeLimitReached'' function when it itself receives ''USR2'' -- this //does not// change the behavior of the ''mpirun'' that is executing in the background, though (it will still ignore ''USR2'').  Finally, the ''wait'' command puts the job shell to sleep (no CPU usage) until all child processes have exited (namely, ''mpirun'').  The ''wait'' command is a function built-in to BASH and not an external program, so it does not cause the job shell to block and delay reaction to ''USR2'' like ''mpirun'' did in the previous example.
  
 +===== Farber Exclusive access =====
  
 +If a job is submitted with the ''-l exclusive=1'' resource, Grid Engine will
  
 +  * promote any serial jobs to 20-core threaded (-pe threads 20) 
 +  * modify any parallel jobs to round-up the slot count to the nearest multiple of 20
 +  * ignore any memory resources and make all memory available on all nodes assigned to the job
  
- +A job running on node with ''-l exclusive=1'' will block any other jobs from making use of resources on that host.
- +
-===== Farber "standby" queues ===== +
- +
-Farber has two standby queues, ''standby.q'' and ''standby-4h.q'', that span all **"standard"** nodes on the cluster. "standard" node has 20 cores and 64GB of memory. +
- +
-====Grid Engine resources governing these queues==== +
-The differences between the two queues is tied to the number of slots and the maximum (wall-clock) //hard// run-time you specify.  +
- +
-    * If you specify a maximum run-time of 4 hours or less (e.g.,''h_rt=4:00:00''), your job will be assigned to ''standby-4h.q''.  +
- +
-    * If you do **not** specify maximum run-time **or** if you specify a run-time greater than 4 hours but not exceeding 8 hours, then you may request up to 200 slots for any job.  The job will be assigned to ''standby.q''.  +
- +
-The total number of concurrent slots in the ''standby.q'' jobs may not exceed 200, and the total in both ''standby.q'' and ''standby-4h.q'' may not exceed 800.  The limit is by user, and all other jobs will wait for your jobs to finish. +
- +
-For example, you could concurrently run 25 20-slot jobs (500 slots). This would leave 300 slots available for any other concurrent standby jobs you may submit.+
  
 Job script example: Job script example:
 <code bash> <code bash>
 # #
-# The standby flag asks to run the job in a standby queue. +# The exclusive flag asks to run this job only on all nodes required to fulfill requested slots 
-#$ -l standby=1+#$ -l exclusive=1
 # #
-# This job needs an openmpi parallel environment using 500 slots. +# This job needs an openmpi parallel environment using 32 slots = 2 nodes exclusively
-#$ -pe openmpi 500+#$ -pe openmpi 32
 # #
-The h_rt flag specifies a 4-hr maximum (hard) run-time limit+By default the slot count granted by Grid Engine will be 
-The flag is required because the job needs more than 240 slots+# used, one MPI worker per slot Set this variable if you 
-#$ -l h_rt=4:00:00+want to use fewer cores than Grid Engine granted you (e.g
 +when using exclusive=1): 
 +
 +#WANT_NPROC=0 
 ... ...
 </code> </code>
  
-==== Mapping jobs to nodes ==== +<note tip>In the script examplethis job would be rounded up to 40 and would be assigned 2 nodes. If you really want your job to run with only 32 slotsuncomment and set ''WANT_NPROC=32''.</note>
-Once Grid Engine determines the appropriate standby queueit maps the job to available, idle, nodes (hosts) to fill all the slots. For openmpi jobs, Grid Engine is configured to use the //fill up// allocation rule, by default.  This will keep the number of nodes down, and thus reduce the amount of inter-node communication. +
- +
-It may be useful to control the number of nodes and the number of processes per node.      +
-For example: +
-   qsub -l standby,h_rt=4:00:00 myopenmpi.qs +
-The MPI processes (ranks) will be mapped to **''ceiling(NSLOTS/20)''** hosts.  All hosts, except possibly the last host, will have exactly 20 processes. +
-The MPI_FLAG ''--display_map'' will display details of the allocations in your Grid Engine output.  The MPI_FLAG ''--loadbalance'' will spread the processes out among the assigned hosts, without changing the number of hosts +
- +
-<note important>If you do not specify an exact multiple of 20, there will be slots available to other queued Grid Engine jobs. +
-If these jobs are assigned to your nodes (you have them for up to 8 hours)they will compete for shared resources: +
-^ Resource ^ Shared ^ +
-| cores | 20 | +
-| memory |64 GBs  | +
-If you add the ''exclusive'' resource in your queue script file +
-   #$ -l exclusive=+
-then Grid Engine will round up your slot request to a multiple of 20 and thus keep other jobs off the node. +
-</note>  +
- +
-<note tip>The allocation rule and the group names are configured in Grid Engine.  You can use the ''qconf'' command to show +
-the current configuration. +
- +
-To see the current allocation rule for ''mpi'': +
- +
-   $ qconf -sp mpi | grep allocation_rule +
-   allocation_rule    $fill_up +
- +
-To see a list of all group names: +
- +
-   $ qconf -shgrpl +
-   @128G +
-   @256G +
-   ...  +
- +
-To see the nodes in a group name: +
-   $ qconf -shgrp @128G +
-   group_name @128G +
-   hostlist n068 n069 n070 n071 n106 n108 n109 n110 n113 +
-</note> +
- +
- +
- +
-====== Exclusive access ======+
  
 Grid Engine is configured to "fill up" nodes by allocating as many slots as possible before proceeding to another node to fulfill the total number of requested slots for the job.  Unfortunately, Grid Engine may not do what you expect, evenly distribute and fill up across the total number of nodes needed for your job.  For example, if you submit four Open MPI jobs each requesting 20 slots and there are four free nodes each with 20 cores on the cluster, you would expect each job to be assigned on a single node, but in fact, the first job may land on a single node, the second, third, and fourth may wind up straddling the remaining three nodes. Grid Engine is configured to "fill up" nodes by allocating as many slots as possible before proceeding to another node to fulfill the total number of requested slots for the job.  Unfortunately, Grid Engine may not do what you expect, evenly distribute and fill up across the total number of nodes needed for your job.  For example, if you submit four Open MPI jobs each requesting 20 slots and there are four free nodes each with 20 cores on the cluster, you would expect each job to be assigned on a single node, but in fact, the first job may land on a single node, the second, third, and fourth may wind up straddling the remaining three nodes.
Line 266: Line 283:
 </code> </code>
  
- 
- 
-====== Farber exclusive access ====== 
  
 If a job is submitted with the general resource, Grid Engine will If a job is submitted with the general resource, Grid Engine will
Line 298: Line 312:
  
 <note tip>In the script example, this job would be rounded up to 40 and would be assigned 2 nodes. If you really want your job to run with only 32 slots, uncomment and set ''WANT_NPROC=32''.</note> <note tip>In the script example, this job would be rounded up to 40 and would be assigned 2 nodes. If you really want your job to run with only 32 slots, uncomment and set ''WANT_NPROC=32''.</note>
- 
-===== Spare nodes ===== 
- 
-Most compute nodes on a Community Cluster are owned by investing entities (faculty and staff).  Clusters generally contain a small number of spare nodes that act as temporary replacements for owned nodes undergoing repair or replacement.  Jobs are usually not assigned to these nodes since at any time they may be needed in this capacity. 
- 
-Community Cluster users can make use of these otherwise idle nodes by special request.  For example, a user publishing a paper may need to quickly execute a few follow-up calculations that were prompted by the peer review process.  The user has just two days in which to run the jobs.  In this case, the user could send a request to IT for access to a cluster's spare nodes for the next two days. 
- 
-<note tip>Investing entity stakeholders can also request access to spare nodes on behalf of their entire group of users.</note> 
- 
-Of course, in that time should spare nodes be needed by IT to stand-in for offline owned nodes, jobs running on the spare nodes may need to be killed.  So while the spare nodes represent on-demand resources that can be used for jobs with a deadline, the user runs the risk of jobs' being interrupted and possibly not being able to finish before that deadline. 
- 
-<note>If jobs running on spare nodes do need to be killed, IT will provide two hours notice via email to the jobs' owners.</note> 
- 
-**Details by cluster** 
- 
-  * Configured on the [[:clusters:mills:runapps#the-job-queues|Mills cluster]] only. 
-==== Requesting access ==== 
- 
-Access to spare nodes can be requested by submitting a [[http://www.udel.edu/deptforms/it/research_computing/index.html|Research Computing Help Request]] specifying ''High-Performance Computing', selecting the appropriate cluster and specify the following information for the problem details. 
- 
-  * the reason for requesting spare nodes 
-  * the cluster on which you will run your jobs 
-  * a brief description of the jobs that will be run 
-  * the date range during which the jobs will be run 
- 
-For the example cited above, the user might write the following: 
- 
-> I am writing a paper for the Journal of Physical Chemistry that includes simulations of ammonia dissolved in water at high pressure.  A reviewer has questioned my results and I need to run two more short simulations to refute his claims. 
- 
-> I would like to use the spare nodes on the Mills cluster starting as soon as possible and lasting two days.  The simulations will run via Open MPI across four nodes and should each last about 12 hours. 
- 
-Job requests are reviewed by IT before access to spare nodes is granted. 
- 
-==== Using spare nodes ==== 
- 
-Once access is granted the spare nodes augment the owned nodes to which the user has access.  No additional flags need to be specified in the user's job scripts or on the command line when the jobs are submitted.  The spare nodes will behave as though they are owned nodes when the user's jobs are scheduled. 
- 
-<note important>Interactive jobs will not be scheduled on spare nodes.  Only batch jobs are permissible.</note> 
- 
-==== Summary ==== 
- 
-  * Spare node resources shall be granted on a per-request basis for a limited time period. 
- 
-  * Access to spare nodes shall be granted following an IT review of the request. 
- 
-  * Spare nodes augment those nodes already available to the user.  Batch jobs will be scheduled on spare nodes without authorized users' having to explicitly request it. 
- 
-  * Spare nodes may be repurposed at any time to replace owned nodes being repaired or replaced. In these instances, IT will e-mail users running jobs on the affected spare nodes two hours prior to killing those jobs. 
- 
- 
  • abstract/farber/runjobs/queues.1526989987.txt.gz
  • Last modified: 2018-05-22 07:53
  • by sraskar