Both sides previous revision Previous revision Next revision | Previous revision |
abstract:farber:runjobs:job_status [2018-05-21 23:37] – sraskar | abstract:farber:runjobs:job_status [2021-10-13 16:49] (current) – [Per-Group QStat and QHost] anita |
---|
====== Viewing Job Status Information ====== | ====== Managing Jobs on Farber ====== |
| |
Once a user has been able to submit jobs to the queue -- interactive or batch -- the user will from time to time want to know what those jobs are doing. Is the job waiting in a queue for resources to become available, or is it executing? How long has the job been executing? How much CPU time or memory has the job consumed? Users can query Grid Engine for job information using the ''qstat'' command. The ''qstat'' command has a variety of command line options available to customize and filter what information it displays; discussing all of them is beyond the scope of this document. Please see the ''qstat'' man page for a detailed description of all options. | Once a user has been able to submit jobs to the queue -- interactive or batch -- the user will from time to time want to know what those jobs are doing. Is the job waiting in a queue for resources to become available, or is it executing? How long has the job been executing? How much CPU time or memory has the job consumed? Users can query Grid Engine for job information using the ''qstat'' command. The ''qstat'' command has a variety of command line options available to customize and filter what information it displays; discussing all of them is beyond the scope of this document. Please see the ''qstat'' man page for a detailed description of all options. |
The ''qstat'' command also allows the user to see job status information for any other cluster user by means of the ''-u'' flag. The flag requires a single argument: a username or the wildcard character (''\*''): | The ''qstat'' command also allows the user to see job status information for any other cluster user by means of the ''-u'' flag. The flag requires a single argument: a username or the wildcard character (''\*''): |
<code bash> | <code bash> |
[(it_css:traine)@mills it_css]$ qstat -u traine | [(it_css:traine)@farber it_css]$ qstat -u traine |
: | : |
[(it_css:traine)@mills it_css]$ qstat -u \* | [(it_css:traine)@farber it_css]$ qstat -u \* |
: | : |
</code> | </code> |
In all forms discussed above the output from ''qstat'' focuses on jobs. To instead view the status information in a host-centric format, the ''-f'' option should be added to the ''qstat'' command. The output from ''qstat -f'' is organized by queue instances (thus, also by compute hosts) with jobs running in a particular queue instance summarized therein: | In all forms discussed above the output from ''qstat'' focuses on jobs. To instead view the status information in a host-centric format, the ''-f'' option should be added to the ''qstat'' command. The output from ''qstat -f'' is organized by queue instances (thus, also by compute hosts) with jobs running in a particular queue instance summarized therein: |
<code bash> | <code bash> |
[(it_css:traine)@mills it_css]$ qstat -f -q 'it_css*' | [(it_css:traine)@farber it_css]$ qstat -f -q 'it_css*' |
queuename qtype resv/used/tot. load_avg arch states | queuename qtype resv/used/tot. load_avg arch states |
--------------------------------------------------------------------------------- | --------------------------------------------------------------------------------- |
| |
<code bash> | <code bash> |
[(it_css:traine)@mills it_css]$ qhost -h n013 -h n014 | [(it_css:traine)@farber it_css]$ qhost -h n013 -h n014 |
HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS | HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS |
------------------------------------------------------------------------------- | ------------------------------------------------------------------------------- |
| |
<code bash> | <code bash> |
[(it_css:traine)@mills it_css]$ qstat -j 82518 | [(it_css:traine)@farber it_css]$ qstat -j 82518 |
============================================================== | ============================================================== |
job_number: 82518 | job_number: 82518 |
sge_o_shell: /bin/bash | sge_o_shell: /bin/bash |
sge_o_workdir: /lustre/work/it_css | sge_o_workdir: /lustre/work/it_css |
sge_o_host: mills | sge_o_host: farber |
account: sge | account: sge |
cwd: /lustre/work/it_css | cwd: /lustre/work/it_css |
merge: y | merge: y |
hard resource_list: idle_resources=0,dev_resources=0,exclusive=1,standby_resources=1,scratch_free=1000000 | hard resource_list: idle_resources=0,dev_resources=0,exclusive=1,standby_resources=1,scratch_free=1000000 |
mail_list: traine@mills.hpc.udel.edu | mail_list: traine@farber.hpc.udel.edu |
notify: FALSE | notify: FALSE |
job_name: mpibounce.qs | job_name: mpibounce.qs |
| |
<code bash> | <code bash> |
[(it_css:traine)@mills ~]$ qjobs | [(it_css:traine)@farber ~]$ qjobs |
=============================================================================== | =============================================================================== |
JobID Owner State Submitted as | JobID Owner State Submitted as |
| |
<code bash> | <code bash> |
[(it_css:traine)@mills ~]$ qjobs -g sandler_thermo | [(it_css:traine)@farber ~]$ qjobs -g sandler_thermo |
=============================================================================== | =============================================================================== |
JobID Owner State Submitted as | JobID Owner State Submitted as |
The ''qstatgrp'' command by default summarizes usage of all queues to which the user has access given his/her current working group. Adding the ''-j'' flag summarizes the jobs executing in those queues rather than summarizing the queues themselves. | The ''qstatgrp'' command by default summarizes usage of all queues to which the user has access given his/her current working group. Adding the ''-j'' flag summarizes the jobs executing in those queues rather than summarizing the queues themselves. |
| |
The ''qhostgrp'' command by default summarizes usage of all hosts to which the user has access given his/her current working group. Adding the ''-j'' flag summarizes the jobs (including [[general/jobsched/standby|standby]]) executing on those hosts rather than summarizing the hosts themselves. | The ''qhostgrp'' command by default summarizes usage of all hosts to which the user has access given his/her current working group. Adding the ''-j'' flag summarizes the jobs (including [[abstract:farber:runjobs:queues#farber-standby-queues|standby]]) executing on those hosts rather than summarizing the hosts themselves. |
| |
Both ''qstatgrp'' and ''qhostgrp'' accept a ''-g ''<<''group name''>> option to limit to an arbitrary group (and not just the user's current working group). | Both ''qstatgrp'' and ''qhostgrp'' accept a ''-g ''<<''group name''>> option to limit to an arbitrary group (and not just the user's current working group). |
**Details by cluster** | **Details by cluster** |
| |
* [[clusters:mills:runapps#resource-management-options|Mills]] | * [[abstract:farber:runjobs:schedule_jobs#resource-management-options-on-farber|Farber]] |
* [[clusters:farber:runapps#resource-management-options|Farber]] | |
| |
===== Managing Jobs ===== | ===== Managing Jobs ===== |
| |
<code base> | <code base> |
[(it_css:traine)@mills it_css]$ qstat -u traine | [(it_css:traine)@farber it_css]$ qstat -u traine |
job-ID prior name user state submit/start at queue slots ja-task-ID | job-ID prior name user state submit/start at queue slots ja-task-ID |
----------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------- |
| |
<code base> | <code base> |
[(it_css:traine)@mills it_css]$ qalter -w p 99154 | tail -10 | [(it_css:traine)@farber it_css]$ qalter -w p 99154 | tail -10 |
Job 99154 has no permission for cluster queue "puleo-qrsh.q" | Job 99154 has no permission for cluster queue "puleo-qrsh.q" |
Job 99154 has no permission for cluster queue "capsl.q+" | Job 99154 has no permission for cluster queue "capsl.q+" |
| |
<code base> | <code base> |
[(it_css:traine)@mills it_css]$ qstatgrp | [(it_css:traine)@farber it_css]$ qstatgrp |
CLUSTER QUEUE CQLOAD USED RES AVAIL TOTAL aoACDPS cdsuE | CLUSTER QUEUE CQLOAD USED RES AVAIL TOTAL aoACDPS cdsuE |
it_css-dev.q 0.00 0 0 72 72 0 0 | it_css-dev.q 0.00 0 0 72 72 0 0 |
</code> | </code> |
| |
Use **qalter** to change the attributes of the pending job such as reducing the number of slots requested to be within the workgroup ''it_css'' nodes or change the resources specified to the [[general:jobsched:standby|standby queue]] so the job could run. For example, let's change the number of slots requested to 48 instead of 144 by using | Use **qalter** to change the attributes of the pending job such as reducing the number of slots requested to be within the workgroup ''it_css'' nodes or change the resources specified to the [[:abstract:farber:runjobs:queues#farber-standby-queues|standby queue]] so the job could run. For example, let's change the number of slots requested to 48 instead of 144 by using |
| |
<code base> | <code base> |
[(it_css:traine)@mills it_css]$ qalter -pe openmpi 48 99154 | [(it_css:traine)@farber it_css]$ qalter -pe openmpi 48 99154 |
modified parallel environment of job 99154 | modified parallel environment of job 99154 |
modified slot range of job 99154 | modified slot range of job 99154 |
[(it_css:traine)@mills it_css]$ qstat -u traine | [(it_css:traine)@farber it_css]$ qstat -u traine |
job-ID prior name user state submit/start at queue slots ja-task-ID | job-ID prior name user state submit/start at queue slots ja-task-ID |
----------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------- |
Another way to get this job running would be to change the resource for the job to run in the standby queue. To do this you must specify all resources since ''qalter'' completely replaces any parameters previously specified for the job by that option. In this example, we alter the job to run in the standby queue by using | Another way to get this job running would be to change the resource for the job to run in the standby queue. To do this you must specify all resources since ''qalter'' completely replaces any parameters previously specified for the job by that option. In this example, we alter the job to run in the standby queue by using |
<code base> | <code base> |
[(it_css:traine)@mills it_css]$ qalter -l idle=0,standby=1 99154 | [(it_css:traine)@farber it_css]$ qalter -l idle=0,standby=1 99154 |
modified hard resource list of job 99154 | modified hard resource list of job 99154 |
[(it_css:traine)@mills it_css]$ qstat -u traine | [(it_css:traine)@farber it_css]$ qstat -u traine |
job-ID prior name user state submit/start at queue slots ja-task-ID | job-ID prior name user state submit/start at queue slots ja-task-ID |
----------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------- |
| |
<code base> | <code base> |
[(it_css:traine)@mills it_css]$ qstat -u traine | [(it_css:traine)@farber it_css]$ qstat -u traine |
job-ID prior name user state submit/start at queue slots ja-task-ID | job-ID prior name user state submit/start at queue slots ja-task-ID |
----------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------- |
| |
<code base> | <code base> |
[traine@mills ~]$ qstat -j 686924 | grep error | [traine@farber ~]$ qstat -j 686924 | grep error |
error reason 1: 08/12/2014 22:08:27 [1208:60529]: error: can't chdir to /archive/it_css/traine/ex-openmpi: No such file or directory | error reason 1: 08/12/2014 22:08:27 [1208:60529]: error: can't chdir to /archive/it_css/traine/ex-openmpi: No such file or directory |
</code> | </code> |
| |
<code base> | <code base> |
[traine@mills ~]$ qmod -cj 686924 | [traine@farber ~]$ qmod -cj 686924 |
</code> | </code> |
| |
| ''qhostgrp'' | ''qhostgrp'' displays the current status of your owner-group's nodes | | | ''qhostgrp'' | ''qhostgrp'' displays the current status of your owner-group's nodes | |
| ''qhostgrp –g'' <<//investing_entity//>> | ''qhostgrp -g it_css'' displays the current status of the nodes owned by the\\ //it_css// investing-entity. | | | ''qhostgrp –g'' <<//investing_entity//>> | ''qhostgrp -g it_css'' displays the current status of the nodes owned by the\\ //it_css// investing-entity. | |
| ''qhostgrp -j -g'' <<//investing_entity//>> | ''qhostgrp –j -g it_css'' shows all jobs running (including [[general/jobsched/standby|standby]] and spillover) in the owner-group nodes for the //it_css// investing-entity. | | | ''qhostgrp -j -g'' <<//investing_entity//>> | ''qhostgrp –j -g it_css'' shows all jobs running (including [[:abstract:farber:runjobs:queues#farber-standby-queues|standby]] and spillover) in the owner-group nodes for the //it_css// investing-entity. | |
| |
==== Checking overall usage of resource quotas ==== | ==== Checking overall usage of resource quotas ==== |
| |
Resource quotas are used to help control the standby and spillover queues. Each user has a quota based on the limits set by the [[general/jobsched/standby|standby]] queue specifications for each cluster, and each workgroup has a per_workgroup quota based on the number of slots purchased by the research group. | Resource quotas are used to help control the standby and spillover queues. Each user has a quota based on the limits set by the [[:abstract:farber:runjobs:queues#farber-standby-queues|standby]] queue specifications for each cluster, and each workgroup has a per_workgroup quota based on the number of slots purchased by the research group. |
| |
^ Command ^ Illustrative example ^ | ^ Command ^ Illustrative example ^ |
| ''qquota -u'' <<//username//>> ''| grep standby'' | ''qquota -g traine | grep standby'' displays the current usage of slots by user\\ //traine// in the standby resources. | | | ''qquota -u'' <<//username//>> ''| grep standby'' | ''qquota -u traine | grep standby'' displays the current usage of slots by user\\ //traine// in the standby resources. | |
| ''qquota -u \* | grep'' <<//investing_entity//>> | ''qquota -u \* | grep it_css'' displays the current usage of slots being used by all\\ members of the //it_css// investing-entity, the per_workgroup quota. | | | ''qquota -u \* | grep'' <<//investing_entity//>> | ''qquota -u \* | grep it_css'' displays the current usage of slots being used by all\\ members of the //it_css// investing-entity, the per_workgroup quota. | |
| |