Differences

This shows you the differences between two versions of the page.

--- abstract:farber:runjobs:job_status [2017-12-03 18:07] – created sraskar
+++ abstract:farber:runjobs:job_status [2021-10-13 16:49] (current) – [Per-Group QStat and QHost] anita
@@ Line 1: / Line 1: @@
-<booktoc/>
+====== Managing Jobs on Farber ======
-====== Viewing Job Status Information ======
 Once a user has been able to submit jobs to the queue -- interactive or batch -- the user will from time to time want to know what those jobs are doing.  Is the job waiting in a queue for resources to become available, or is it executing?  How long has the job been executing?  How much CPU time or memory has the job consumed?  Users can query Grid Engine for job information using the ''qstat'' command.  The ''qstat'' command has a variety of command line options available to customize and filter what information it displays; discussing all of them is beyond the scope of this document.  Please see the ''qstat'' man page for a detailed description of all options.
@@ Line 30: / Line 29: @@
 The ''qstat'' command also allows the user to see job status information for any other cluster user by means of the ''-u'' flag.  The flag requires a single argument:  a username or the wildcard character (''\*''):
 <code bash>
-[(it_css:traine)@mills it_css]$ qstat -u traine
+[(it_css:traine)@farber it_css]$ qstat -u traine
    :
-[(it_css:traine)@mills it_css]$ qstat -u \*
+[(it_css:traine)@farber it_css]$ qstat -u \*
    :
 </code>
@@ Line 41: / Line 40: @@
 In all forms discussed above the output from ''qstat'' focuses on jobs.  To instead view the status information in a host-centric format, the ''-f'' option should be added to the ''qstat'' command.  The output from ''qstat -f'' is organized by queue instances (thus, also by compute hosts) with jobs running in a particular queue instance summarized therein:
 <code bash>
-[(it_css:traine)@mills it_css]$ qstat -f -q 'it_css*'
+[(it_css:traine)@farber it_css]$ qstat -f -q 'it_css*'
 queuename                      qtype resv/used/tot. load_avg arch          states
 ---------------------------------------------------------------------------------
@@ Line 92: / Line 91: @@
 <code bash>
-[(it_css:traine)@mills it_css]$ qhost -h n013 -h n014
+[(it_css:traine)@farber it_css]$ qhost -h n013 -h n014
 HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
 -------------------------------------------------------------------------------
@@ Line 111: / Line 110: @@
 <code bash>
-[(it_css:traine)@mills it_css]$ qstat -j 82518
+[(it_css:traine)@farber it_css]$ qstat -j 82518
 ==============================================================
 job_number:                 82518
@@ Line 125: / Line 124: @@
 sge_o_shell:                /bin/bash
 sge_o_workdir:              /lustre/work/it_css
-sge_o_host:                 mills
+sge_o_host:                 farber
 account:                    sge
 cwd:                        /lustre/work/it_css
 merge:                      y
 hard resource_list:         idle_resources=0,dev_resources=0,exclusive=1,standby_resources=1,scratch_free=1000000
-mail_list:                  traine@mills.hpc.udel.edu
+mail_list:                  traine@farber.hpc.udel.edu
 notify:                     FALSE
 job_name:                   mpibounce.qs
@@ Line 160: / Line 159: @@
 <code bash>
-[(it_css:traine)@mills ~]$ qjobs
+[(it_css:traine)@farber ~]$ qjobs
 ===============================================================================
 JobID  Owner              State    Submitted as
@@ Line 173: / Line 172: @@
 <code bash>
-[(it_css:traine)@mills ~]$ qjobs -g sandler_thermo
+[(it_css:traine)@farber ~]$ qjobs -g sandler_thermo
 ===============================================================================
 JobID  Owner              State    Submitted as
@@ Line 196: / Line 195: @@
 The ''qstatgrp'' command by default summarizes usage of all queues to which the user has access given his/her current working group.  Adding the ''-j'' flag summarizes the jobs executing in those queues rather than summarizing the queues themselves.
-The ''qhostgrp'' command by default summarizes usage of all hosts to which the user has access given his/her current working group.  Adding the ''-j'' flag summarizes the jobs (including [[general/jobsched/standby|standby]]) executing on those hosts rather than summarizing the hosts themselves.
+The ''qhostgrp'' command by default summarizes usage of all hosts to which the user has access given his/her current working group.  Adding the ''-j'' flag summarizes the jobs (including [[abstract:farber:runjobs:queues#farber-standby-queues|standby]]) executing on those hosts rather than summarizing the hosts themselves.
 Both ''qstatgrp'' and ''qhostgrp'' accept a ''-g ''<<''group name''>> option to limit to an arbitrary group (and not just the user's current working group).
+==== Resource-management options ====
+Any large cluster will have many nodes with perhaps differing resources, e.g., cores, memory, disk space and accelerators.
+The ones you can request come in three categories.
+  - Fixed resources by the configuration - slots and installed memory,
+  - Set by load sensor - CPU load averages, memory usage
+  - Managed by job scheduler internal bookkeeping to ensure availability - available memory and floating software licenses.
+**Details by cluster**
+   * [[abstract:farber:runjobs:schedule_jobs#resource-management-options-on-farber|Farber]]
+===== Managing Jobs =====
+==== Checking job status ====
+Use the **qstat** command to check the status of queued jobs.  Use the ''qstat -h'' or ''man qstat'' commands on the login node to view a complete description of available options.  Some of the most often-used options are summarized here:
+^ Option ^ Result ^
+| ''-j'' <<//job_id_list//>> | Displays information for specified job(s) |
+| ''-u'' <<//user_list//>> | Displays information for jobs associated with the specified user(s) |
+| ''-ext'' | Displays extended information about jobs |
+| ''-t'' | Shows additional information about subtasks |
+| ''-r'' | Shows resource requirements of jobs |
+For example, to list the information for job 62900, type
+<code>
+qstat -j 62900
+</code>
+To list a table of jobs assigned to user //traine// that displays the resource requirements for each job, type
+<code>
+qstat -u traine -r
+</code>
+With no options **qstat** defaults to ''qstat -u $USER'', so you get a table for your jobs.  With  the ''-u'' option the
+**qstat** command uses //Reduced Format// with following columns.
+^ Column header ^ Description ^
+| ''job-ID'' | job id assigned to the job |
+| ''user'' | user who owns the job |
+| ''name'' | job name|
+| ''state'' | current job status, including **qw**(aiting) , **s**(uspended), **r**(unning), **h**(old), **E**(rror), **d**(eletion) |
+| ''submit/start at'' | submit time (waiting jobs) or start time (running jobs) |
+| ''queue'' | name of the queue the job is assigned to (for running or suspended jobs only)|
+| ''slots'' | number of slots assigned to the job |
+=== A more concise listing ====
+The IT-supplied **qjobs** command provides a more convenient listing of job status.
+^ Command ^ Description ^
+| ''qjobs'' | Displays the status of jobs submitted by you |
+| ''qjobs -g'' | Displays the status of jobs submitted by your research group |
+| ''qjobs –g ''<<//investing_entity//>>'' ''| Displays the status of jobs submitted by members of the named investing-entity |
+| ''qjobs –a'' | Displays the status of jobs submitted by **a**ll users |
+In all cases the JobID, Owner, State and Name are listed in a table.
+=== Job status is qw ===
+When your job status is ''qw'' it means your job is queued and waiting to execute.  When you check with ''qstat'' you might see something like this
+<code base>
+[(it_css:traine)@farber it_css]$ qstat -u traine
+job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
+-----------------------------------------------------------------------------------------------------------------
+0.50661 openmpi-pg traine       qw    11/12/2012 14:33:49                                  144
+</code>
+Sometimes your job is stuck and remains in the ''qw'' state and never starts running.  You can use **qalter** to poke at the job scheduler to see why your job is not running.  For example, to see the last 10 lines of the job scheduler validation for job 99154, you can type
+<code base>
+[(it_css:traine)@farber it_css]$ qalter -w p 99154 | tail -10
+Job 99154 has no permission for cluster queue "puleo-qrsh.q"
+Job 99154 has no permission for cluster queue "capsl.q+"
+Job 99154 has no permission for cluster queue "spare.q"
+Job 99154 has no permission for cluster queue "it_nss-qrsh.q"
+Job 99154 has no permission for cluster queue "it_nss.q"
+Job 99154 has no permission for cluster queue "it_nss.q+"
+Job 99154 Jobs cannot run because only 72 of 144 requested slots are available
+Job 99154 Jobs can not run in PE "openmpi" because the resource requirements can not be satified
+verification: no suitable queues
+</code>
+In this example, we asked for 144 slots, but only 72 slots are available for workgroup ''it_css'' nodes.
+<code base>
+[(it_css:traine)@farber it_css]$ qstatgrp
+CLUSTER QUEUE                   CQLOAD   USED    RES  AVAIL  TOTAL aoACDPS  cdsuE
+it_css-dev.q                      0.00      0      0     72     72      0      0
+it_css-qrsh.q                     0.00      0      0     72     72      0      0
+it_css.q                          0.00      0      0     72     72      0      0
+it_css.q+                         0.00      0      0     72     72      0      0
+standby-4h.q                      0.27      0      0   4968   5064      0     96
+standby.q                         0.27     12      0   4932   5064      0    120
+</code>
+Use **qalter** to change the attributes of the pending job such as reducing the number of slots requested to be within the workgroup ''it_css'' nodes or change the resources specified to the [[:abstract:farber:runjobs:queues#farber-standby-queues|standby queue]] so the job could run. For example, let's change the number of slots requested to 48 instead of 144 by using
+<code base>
+[(it_css:traine)@farber it_css]$ qalter -pe openmpi 48 99154
+modified parallel environment of job 99154
+modified slot range of job 99154
+[(it_css:traine)@farber it_css]$ qstat -u traine
+job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
+-----------------------------------------------------------------------------------------------------------------
+0.50661 openmpi-pg traine       r     11/12/2012 14:33:49                                  48
+</code>
+Another way to get this job running would be to change the resource for the job to run in the standby queue.  To do this you must specify all resources since ''qalter'' completely replaces any parameters previously specified for the job by that option. In this example, we alter the job to run in the standby queue by using
+<code base>
+[(it_css:traine)@farber it_css]$ qalter -l idle=0,standby=1 99154
+modified hard resource list of job 99154
+[(it_css:traine)@farber it_css]$ qstat -u traine
+job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
+-----------------------------------------------------------------------------------------------------------------
+0.50661 openmpi-pg traine       r     11/12/2012 15:23:52 standby.q@n016                   144
+</code>
+<note important>''qalter'' can only be used to alter jobs that you own!</note>
+=== Job status is Eqw ===
+When your job status is ''Eqw'' it means an error occurred when Grid Engine attempted to schedule the job, so it has been returned to the qw state.  When you check with ''qstat'' you might see something like this for user ''traine''
+<code base>
+[(it_css:traine)@farber it_css]$ qstat -u traine
+job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
+-----------------------------------------------------------------------------------------------------------------
+0.50509 openmpi-pg traine       Eqw   08/12/2014 19:38:53                                                              1
+</code>
+If the state shows ''Eqw'', then use ''qstat -j //job_id// | grep error'' to check for the error. Here is an example of what you might see
+<code base>
+[traine@farber ~]$ qstat -j 686924 | grep error
+error reason    1:          08/12/2014 22:08:27 [1208:60529]: error: can't chdir to /archive/it_css/traine/ex-openmpi: No such file or directory
+</code>
+This error indicates that some directory or file (respectively) cannot be found. Verify that the file or directory in question exists, i.e., you haven't forgotten to create it and you can see it from the head node and compute node. If it appears to be okay, then the job may have suffered a transient condition such as a failed NFS automount, the NFS server was temporarily down, or some other filesystem error occurred.
+If you understand the reason and can get it fixed, use ''qmod -cj //job_id//'' to clear the error state like this:
+<code base>
+[traine@farber ~]$ qmod -cj 686924
+</code>
+and it should eventually run.
+==== Checking queue status ====
+The **qstat** command can also be used to get status of all queues on the system.
+^ Option ^ Result ^
+| ''-f'' | Displays summary information for all queues |
+| ''-ne'' | Suppresses the display of empty queues. |
+| ''-qs'' {a%%|%%c%%|%%d%%|%%o!%%|%%s%%|%%u%%|%%A%%|%%C%%|%%D%%|%%E%%|%%S} | Selects queues to be displayed according to state |
+With the ''-f'' option, **qstat** uses //full format//, which includes the following columns.
+^ Column header ^ Description ^
+| ''queuename'' | job id assigned to the job |
+| ''resv/used/total'' | Number of slots reserved/used/total |
+| ''states'' | current job status, including **a**(larm), **s**(uspended), **d**(isabled), **h**(old),\\ **E**(rror), **P**(reempted) |
+Examples:
+List all queues that are unavailable because they are disabled or the slotwise preemption limits have been reached.
+<code bash>
+qstat -f -qs dP
+</code>
+List the queues associated with the investing entity //it_css//.
+<code bash>
+qstat -f | egrep '(queuename|it_css)'
+</code>
+==== Checking overall queue and node information ====
+You can determine overall queue and node information using the ''qstatgrp'', ''qconf'', ''qnodes'' and ''qhostgrp'' commands. Use a command's ''-h'' option to see its command syntax. To obtain information about a group other than your current group, use the ''-g'' option.
+^ Command ^ Illustrative example ^
+| ''qstatgrp''  | ''qstatgrp'' shows a summary of the status of the owner-group queues of your current workgroup.**** |
+| ''qstatgrp -j'' | ''qstatgrp -j'' shows the status of each job in the owner-group queues that members\\ of your current workgroup submitted.**** |
+| ''qstatgrp -g'' <<//investing_entity//>>  | ''qstatgrp -g it_css'' shows  the status of all the owner-group queues for the\\ //it_css// investing-entity. |
+| ''qstatgrp -j -g'' <<//investing_entity//>>  | ''qstatgrp –j -g it_css'' shows the status of each job in the owner-group queues that\\ members of the //it_css//  investing-entity submitted.**** |
+| ''qconf -sql''  | **S**hows all **q**ueues as a **l**ist. |
+| ''qconf -sq'' <<//queue_name//>>*  | ''qconf -sq it_css*'' displays the configuration of each owner-group queues for the\\ //it_css// investing-entity. |
+| ''qnodes''  | ''qnodes'' displays the names of your owner-group's nodes |
+| ''qnodes -g'' <<//investing_entity//>>  | ''qnodes -g it_css'' displays the name of the nodes owned by the\\ //it_css// investing-entity. |
+| ''qhostgrp''  | ''qhostgrp'' displays the current status of your owner-group's nodes |
+| ''qhostgrp –g'' <<//investing_entity//>>  | ''qhostgrp -g it_css'' displays the current status of the nodes owned by the\\ //it_css// investing-entity. |
+| ''qhostgrp -j -g'' <<//investing_entity//>>  | ''qhostgrp –j -g it_css'' shows all jobs running (including [[:abstract:farber:runjobs:queues#farber-standby-queues|standby]] and spillover) in the owner-group nodes for the //it_css//  investing-entity. |
+==== Checking overall usage of resource quotas ====
+Resource quotas are used to help control the standby and spillover queues.  Each user has a quota based on the limits set by the [[:abstract:farber:runjobs:queues#farber-standby-queues|standby]] queue specifications for each cluster, and each workgroup has a per_workgroup quota based on the number of slots purchased by the research group.
+^ Command ^ Illustrative example ^
+| ''qquota -u'' <<//username//>> ''| grep standby''  | ''qquota -u traine | grep standby'' displays the current usage of slots by user\\ //traine// in the standby resources.  |
+| ''qquota -u \* | grep'' <<//investing_entity//>>  | ''qquota -u \* | grep it_css'' displays the current usage of slots being used by all\\ members of the //it_css// investing-entity, the per_workgroup quota.  |
+The example below gives a snapshot of slots being used by ''traine'' user in the standby queues and the slots being used by all members of the workgroup ''it_css''
+<code>
+$ qquota -u traine | grep standby
+standby_limits/4h  slots=80/800         users traine queues standby-4h.q
+standby_cumulative/default slots=80/800         users traine queues standby.q,standby-4h.q
+$ qquota -u \* | grep it_css
+per_workgroup/it_css slots=141/200        users @it_css queues it_css.q,spillover.q
+</code>
+<note important>If there are no jobs running as part of your workgroup, then your per_workgroup quota (of 0 out of N slots) doesn't get displayed, period.</note>
+==== Deleting a job ====
+Use the **qdel**  <<//job_id//>> command to remove pending and running jobs from the queue.
+For example, to delete job 28000
+<code bash>
+  qdel 28000
+</code>
+<note important>**Your job is not deleted**
+If you have a job that remains in a delete state, even after you try to delete it with the
+**qdel** command, then try a force deletion with
+<code bash>
+  qdel -f 28000
+</code>
+This will just forget about the job without attempting any cleanup on the node(s) being used.
+</note>