abstract:farber:runjobs:schedule_jobs

Scheduling Jobs on Farber

In order to schedule any job (interactively or batch) on a cluster, you must set your workgroup to define your cluster group or investing-entity compute nodes.

As discussed, an interactive job allows a user to enter a sequence of commands manually. The following qualify as being interactive jobs:

  • A program with a GUI: e.g. creating graphs in Matlab
  • A program that requires manual input: e.g. a menu-driven post-processing program
  • Any task that is more easily performed manually

As far as the final bullet point goes, suppose a user has a long-running batch job and must later extract results from its output using a single command that will execute for a short time (say five minutes). While the user could go to the effort of creating a batch job, it may be easier to just run the command interactively and visually note its output.

All interactive jobs should be scheduled to run on the compute nodes, not the login/head node.

An interactive session (job) can often be made non-interactive (interactive job) by putting the input in a file, using the redirection symbols < and >, and making the entire command a line in a job script file:

program_name < input_command_file > output_command_file

Then the non-interactive (batch job) job can be scheduled as a batch job.

Starting an interactive session

Remember you must specify your workgroup to define your cluster group or investing-entity compute nodes before submitting any job, and this includes starting an interactive session. Now use the Grid Engine command qlogin on the login (head) node. Grid Engine will look for a node with a free scheduling slot (processor core) and a sufficiently light load, and then assign your session to it. If no such node becomes available, your qlogin request will eventually time out. The qlogin command results in a job in the workgroup interactive serial queue, <investing_entity>-qrsh.q.

Type

    workgroup -g //investing-entity//

Type

    qlogin

to reserve one scheduling slot and start an interactive shell on one of your workgroup investing-entity compute nodes.

Type

    qlogin –pe threads 12

to reserve 12 scheduling slots and start an interactive shell on one your workgroup investing-entity compute node.

Type

    exit

to terminate the interactive shell and release the scheduling slot(s).

Acceptable nodes for interactive sessions

Use the login (head) node for interactive program development including Fortran, C, and C++ program compilation. Use Grid Engine (qlogin) to start interactive shells on your workgroup investing-entity compute nodes.

In Grid Engine, interactive jobs are submitted to the job scheduler using the qlogin command:

[(it_css:traine)@farber it_css]$ qlogin
Your job 78731 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 78731 has been successfully scheduled.
Establishing /opt/shared/GridEngine/local/qlogin_ssh session to host n013 ...
[traine@n013 it_css]$

Dissecting this text, we see that:

  1. the job was assigned a numerical job identifier or job id of 78731
  2. the job is named "QLOGIN"
  3. the job is executing on compute node n013
  4. the final line is a shell prompt, running on n013 and waiting for commands to be typed

What is not apparent from the text:

  • the shell prompt on compute node n013 has as its working directory the directory in which the qlogin command was typed (it_css)
  • if resources had not been immediately available to this job, the text would have "hung" at "waiting for interactive job to be scheduled …" and later resumed with the message about its being successfully scheduled
To execute an interactive job only if resources are immediately available, use the -now command-line flag:
[(it_css:traine)@farber it_css]$ qlogin -now y
Your job 78735 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...timeout (4 s) expired while waiting on socket fd 4
 
Your "qlogin" request could not be scheduled, try again later.
[(it_css:traine)@farber it_css]$ 

By default an interactive job submitting using qlogin is given a name of "QLOGIN." This can get confusing if a user has many interactive jobs submitted at one time. Taking a moment to name each interactive job according to its purpose may save the user a lot of effort later:

[(it_css:traine)@farber it_css]$ qlogin -N 'Matlab graphs'
Your job 78737 ("Matlab graphs") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 78737 has been successfully scheduled.
Establishing /opt/shared/GridEngine/local/qlogin_ssh session to host n013 ...
[traine@n013 it_css]$

The name provided with the -N command-line option will appear in job status listings (see the next section).

Prerequisite to the submission of batch jobs to the job scheduler is the writing of a job script. Grid Engine job scripts follow the same form as shell scripts, with a few exceptions:

  • the "interpreter line" can be omitted
  • the file need not have its executable bit set
  • comment lines starting with "#$" act like command-line options when the job is submitted

The first two points actually keep a job script "safer" because the script cannot be mistakenly executed on the head node. When a batch job is submitted Grid Engine makes a copy of the script and removes the executable bit for exactly this reason.

The simplest possible job script looks something like this:

job_script_00.qs
echo "Hello, world."

As Grid Engine is configured on UD clusters, this job script would be executed within a BASH shell. To use a different shell, the -S command-line option can be embedded in the job script:

job_script_01.qs
#$ -S /bin/tcsh
 
echo "Hello, world."

Grid Engine provides the qsub command for scheduling batch jobs:

command Action
qsub «command_line_options» «job_script» Submit job with script command in the file «job_script»

For example,

[(it_css:traine)@farber it_css]$ qsub job_script_01.qs 
Your job 78742 ("job_script_01.qs") has been submitted

Notice that the job name defaults to being the name of the job script; as discussed in the previous section, a job name can also be explicitly provided

job_script_02.qs
#$ -N testing002
 
echo "Hello, world."

which when submitted would yield

[(it_css:traine)@farber it_css]$ qsub job_script_02.qs 
Your job 78745 ("testing002") has been submitted

It has already been demonstrated that command-line options to the qsub command can be embedded in a job script. Likewise, the options can be specified on the command line. For example:

[(it_css:traine)@farber it_css]$ qsub -N 'testingtoo' job_script_02.qs
Your job 78748 ("testingtoo") has been submitted

The -N option was provided in the queue script and on the command line itself: Grid Engine will honor options from the command line in preference to those embedded in the script. Thus, in this case the "testingtoo" provided on the command line overrode the "testing002" from the job script.

The qsub command has many options available, all of which are documented in its man page. A few of the often-used options will be discussed here.

Default Options

There are several default options that are automatically added to every qsub by Grid Engine:

OptionDiscussion
-j yRegular (stdout) and error (stderr) output emitted by the job script should go to a single file
-cwdWhen the job executes, its working directory should be the working directory at the time of job submission
-w wGrid Engine checks submitted jobs to ensure that at least one queue will accept them; this option indicates that jobs with no valid queue produce a warning and remain queued

There are default resource requirements supplied, as well, but they are beyond the scope of this section. Providing an alternate value for any of these arguments – in the job script or on the qsub command line – overrides the default value.

Email Notifications

Since batch jobs can run unattended, the user may want to be notified of status changes for a job: when the job begins executing; when the job finishes; or if the job was killed. Grid Engine will deliver such notifications (as emails) to a job's owner if the owner requests them using the -m option. This option has a single argument, consisting of letters indicating the state changes for which notifications should be delivered:

LetterState Change
bThe job has started executing
eThe job has completed execution without error
aThe job aborted or was rescheduled
sThe job was suspended

To receive notification when the job is finished – successfully or in error – the user would specify -m ea either on the command line or in the job script. The user should supply the target email address for these notifications using the -M option:

#$ -N 'Sample job'
#$ -m ea
#$ -M traine@gmail.com
 
echo "Hello, world."

Scheduling in the Future

Some jobs may only be eligible for execution after a certain date and time have passed. While a user could wait until that time has arrived to submit the job, Grid Engine also allows a job to be submitted with a requested start time. Grid Engine will do its best to meet that date and time. For example, on September 14 a user arranges with an external agent to copy a weather data file to the cluster around 6:00 p.m. on September 20. The user wishes to process the data (allowing 30 minutes for the file transfer to complete) as soon as possible. On September 14, the user could submit a batch job to be executed in the future:

[(it_css:traine)@farber it_css]$ qsub -a 201209201830 process_weather.qs
Your job 78758 ("process_weather.qs") has been submitted

where the argument to the -a option is in the form YYYYMMDDHHmm (year, month, day, hour, minute).

Equally as important as executing the job is capturing any output produced by the job. As mentioned above, the -j y option sends all output (stdout and stderr) to a single file. By default, that output file is named according to the formula

[job name].o[job id]

For the weather-processing example above, the output would be found in

[(it_css:traine)@farber it_css]$ qsub -a 201209201830 process_weather.qs
Your job 78758 ("process_weather.qs") has been submitted
[(it_css:traine)@farber it_css]$ 
#
#   ... some time goes by ...
#
[(it_css:traine)@farber it_css]$ ls *.o*
process_weather.qs.o78758
In the job script itself it is often counterproductive to redirect a constituent command's output to a file. Allowing all output to stdout/stderr to be directed to the file provided by Grid Engine automatically provides a degree of "versioning" of all runs of the job by way of the .o[job id] suffix on the output file's name.

The name of the output file can be overridden using the -o command-line option to qsub. The argument to this option is the name of the file, possibly containing special characters that will be replaced by the job id, job name, etc. See the qsub man page for a complete description.

If the user overrides the default joining of regular and error output to a single file (using -y n), the error output is directed to a file named as described above but with a .e[job id] suffix. Likewise, an explicit filename can be provided using the -e option.

A user may mistakenly omit the script filename from the qsub command. Surprisingly, qsub does not complain in such a situation; instead, it pauses and allows the user to type a script:

[(it_css:traine)@farber it_css]$ qsub
#
# Oops, I forgot to provide a job script to qsub!
#
 
echo "Oops, I did it again."
 
^D
Your job 78774 ("STDIN") has been submitted
[(it_css:traine)@farber it_css]$ 
#
#   ... some time goes by ...
#
[(it_css:traine)@farber it_css]$ cat STDIN.o78774 
Oops, I did it again.

The "^D" represents holding down the "control" key and pressing the "D" key; this signals "end of file" and lets qsub know that the user is done entering lines of text. By default, a batch job submitted in this fashion will be named "STDIN".

For example,

 qsub myproject.qs

or to submit a standby job that waits for idle nodes (up to 240 slots for 8 hours),

 qsub -l standby=1 myproject.qs

or to submit a standby job that waits for idle 48-core nodes (if you are using a cluster with 48-core nodes like farber)

 qsub -l standby=1 -q standby.q@@48core myproject.qs
 

or to submit a standby job that waits for idle 24-core nodes, (would not be assigned to any 48-core nodes; important for consistency of core assignment)

 qsub -l standby=1 -q standby.q@@24core myproject.qs

or to submit to the four hour standby queue (up to 816 slots spanning all nodes)

 qsub -l standby=1,h_rt=4:00:00 myproject.qs

or to submit to the four hour standby queue spanning just the 24-core nodes.

 qsub -l standby=1,h_rt=4:00:00 -q standby-4h.q@@24core myproject.qs

This file myproject.qs will contain bash shell commands and qsub statements that include qsub options and resource specifications. The qsub statements begin with #$.

We strongly recommend that you use a script file that you pattern after the prototypes in /opt/shared/templates and save your job script files within a $WORKDIR (private work) directory.

Reusable job scripts help you maintain a consistent batch environment across runs. The optional .qs filename suffix signifies a queue-submission script file.

See also resource options to specify memory free and/or available, exclusive access, and requesting specific Matlab licenses.

In every batch session, Grid Engine sets environment variables that are useful within job scripts. Here are some common examples. The rest appear in the ENVIRONMENTAL VARIABLES section of the qsub man page.

Environment variable Contains
HOSTNAME Name of the execution (compute) node
JOB_ID Batch job id assigned by Grid Engine
JOB_NAME Name you assigned to the batch job (See Command options for qsub)
NSLOTS Number of scheduling slots (processor cores) assigned by Grid Engine to this job
SGE_TASK_ID Task id of an array job sub-task (See Array jobs)
TMPDIR Name of directory on the (compute) node scratch filesystem

When Grid Engine assigns one of your job's tasks to a particular node, it creates a temporary work directory on that node's 1-2 TB local scratch disk. And when the task assigned to that node is finished, Grid Engine removes the directory and its contents. The form of the directory name is

/scratch/[$JOB_ID].[$SGE_TASK_ID].«queue_name»

For example after qlogin type

    echo $TMPDIR

to see the name of the node scratch directory for this interactive job.

/scratch/71842.1.it_css-qrsh.q

See Filesystems and Computing environment for more information about the node scratch filesystem and using environment variables.

Grid Engine uses these environment variables' values when creating the job's output files:

File name patter Description
[$JOB_NAME].o[$JOB_ID] Default output filename
[$JOB_NAME].e[$JOB_ID] error filename (when not joined to output)
[$JOB_NAME].po[$JOB_ID] Parallel job output output (Empty for most queues)
[$JOB_NAME].pe[$JOB_ID] Parallel job error filename (Usually empty)

The most commonly used qsub options fall into two categories: operational and resource-management. The operational options deal with naming the output files, mail notification of the processing steps, sequencing of a series of jobs, and establishing the UNIX environment. The resource-management options deal with the specific system resources you desire or need, such as parallel programming environments, number of processor cores, maximum CPU time, and virtual memory needed.

The table below lists qsub's common operational options.

Option / Argument Function
-N «job_name» Names the job <job_name>. Default: the job script's full filename.
-m {b|e|a|s|n} Specifies when e-mail notifications of the job's status should be sent: beginning, end, abort, suspend. Default: never
-M «email_address» Specifies the email address to use for notifications.
-j {y|n} Joins (redirects) the STDERR results to STDOUT. Default: y(yes)
-o «output_file» Directs job output STDOUT to <output_file>. Default: see Grid Engine environment variables
-e «error_file» Directs job errors (STDERR) to <error_file>. File is only produced when the qsub option –j n is used.
-hold_jid <job_list> Holds job until the jobs named in <job_list> are completed. Job may be listed as a list of comma-separated numeric job ids or job names.
-t «task_id_range» Used for array jobs. See Array jobs for details.
Special notes for IT clusters:
-cwd Default. Uses current directory as the job's working directory.
-V Ignored. Generally, the login node's environment is not appropriate to pass to a compute node. Instead, you must define the environment variables directly in the job script.
-q «queue_name» Not need in most cases. Your choice of resource-management options determine the queue.
The resource-management options for qsub have two common forms:
-l «resource»=«value»
-pe «parallel_environment» «Nproc»

For example, putting the lines

#$ -l h_cpu=1:30:00
#$ –pe threads 12

in the job script tells Grid Engine to set a hard limit of 1.5 hours on the CPU time resource for the job, and to assign 12 processors for your job.

Grid Engine tries to satisfy all of the resource-management options you specify in a job script or as qsub command-line options. If there is a queue already defined that accepts jobs having that particular combination of requests, Grid Engine assigns your job to that queue.

You may give a resource request list in the form -l resource=value. A list of available resources with their associated valid value specifiers can be obtained by the command:

qconf -sc

Each named complex or shortcut can be a resource. There can be multiple, comma separated, resource=value pairs. The valid values are determined by the type. Examples, MEMORY type could be 5G (5 GigaBytes), or a TIME type could be 1:30:00 (1 hour 30 minutes).

In a cluster as large a Farber, the two most important resources are cores (CPUs) and memory. The number of cores is called slots. It is listed as a "requestable" and "consumable" resource. Parallel jobs, by definition, can use multiple cores. Thus, the slots resource is handled by the parallel-environments option -pe, and you do not need to put it in a resource list.

For memory you will be concerned about how much is free. Memory resources come as both consumable and sensor driven (not consumable). For example:

memory resource Consumable Explanation
m_mem_free Yes Memory consumed per CPU DURING execution

The m_mem_free is consumable, which means you are reserving the memory for future use. Other jobs, using m_mem_free, may be barred from starting on the node. If you are specifying memory resources for a parallel environment job, the requested memory is multiplied by the slot count. By default, m_mem_free is defined as 1GB of memory per core (slot), if not specified.

When using a shared memory parallel computing environment -pe threads, divide the total memory needed by the number of slots. For example, to request 48G of shared memory for an 8 thread job, request 6G (6G per slot) i.e.,'-l m_mem_free=6G'
Please note a job error will occur and prevent the queue from accepting jobs when:
  1. specifying "N gigabytes" but omitting the unit such as "G", e.g. "-l m_mem_free=3"
  2. putting a space between the number and the unit, e.g. "-l m_mem_free=3 G"

The correct form should be "-l m_mem_free=3G" for this example.

Example

Consider 30 serial jobs, which each require 20 Gbytes of memory. Use the command

qsub -l m_mem_free=20G -t 1-30 myjob.qs

This will submit 30 jobs to the queue, with the SGE_TASK_ID variable set for use in the myjobs.qs script (an array job.) The m_mem_free resource will tell Grid Engine to not schedule a job on a node unless the specified amount of memory i.e., 20GB per CPU is available to consume on that node. Since this is a serial job that runs on a single CPU,20GB can be termed as total memory available for the job.

The /opt/shared/templates/gridengine directory contains basic prototype job scripts for non-interactive parallel jobs. This section describes the –pe parallel environment option that's required for MPI jobs, openMP jobs and other jobs that use the SMP (threads) programming model.

Type the command:

qconf –spl

to display a list of parallel environments available on a cluster.

The general form of the parallel environment option is:

-pe «parallel_environment» «Nproc»

where «Nproc» is the number of processor slots (cores) requested. Just use a single number, and not a range. Grid Engine tries to locate as many free slots as it can and assigns them to that batch job. The environment variable $NSLOTS is given that value.

The two most used parallel environments are threads and mpi.

The threads parallel environment

Jobs such as those having openMP directives use the threads parallel environment, an implementation of the shared-memory programming model. These SMP jobs can only use the cores on a single node.

For example, if your group only owns nodes with 24 cores, then your –pe threads request may only ask for 24 or fewer slots. Use Grid Engine's qconf command to determine the names and characteristics of the queues and compute nodes available to your investing-entity group on a cluster.

Threaded jobs do not necessarily complete faster when more slots are made available. Before running a series of production runs, you should experiment to determine how many slots generally perform best. Using that quantity will leave the remaining slots for others in your group to request. Remember: others can see how many slots you're using!

OpenMP jobs

For openMP jobs, add the following bash command to your job script:

export OMP_NUM_THREADS=$NSLOTS
IT provides a job script template called openmp.qs available in /opt/shared/templates/gridengine/openmp to copy and customize for your OpenMP jobs.

The mpi parallel environment

MPI jobs inherently generate considerable network traffic among the processor cores of a cluster's compute nodes. The processors on the compute node may be connected by two types of networks: InfiniBand and Gigabit Ethernet.

IT has developed templates to help with the openmpi parallel environments for Farber, targeting different user needs and architecture. You can copy the templates from /opt/shared/templates/gridengine/openmpi and customize them. These templates are essentially identical with the exception of the presence or absence of certain qsub options and the values assigned to MPI_FLAGS based on using particular environment variables. In all cases, the parallel environment option must be specified:

-pe mpi «NPROC»

where <NPROC> is the number of processor slots (cores) requested. Use a single number, not a range. Grid Engine tries to locate as many free slots as it can and assigns them to that job. The environment variable NSLOTS is given that value.

IT provides several job script templates in /opt/shared/templates/gridengine/openmpi to copy and customize for your Open MPI jobs. See Open MPI on Farber for more details about these job scripts.

Using the resource option -l nvidia_gpu=1 or -l gpu=1 will schedule your job on a host with a GPU co-processor and blocks any other jobs from using it at the same time.

Using the resource option -l intel_phi=1 or -l phi=1 will schedule your job on a host with a PHI co-processor and blocks any other jobs from using it at the same time.

This doesn't give you exclusive access to the node, only the co-processor on the node. See exclusive access for details blocking any other jobs on the node.

Running Jobs with Parallelism

The interactive and batch jobs discussed thus far have all been serial in nature: they exist as a sequence of instructions executed in order on a single CPU core. Many problems solved on a computer can be solved more quickly by breaking the job into pieces that can be solved concurrently. If one worker moves a pile of bricks from point A to point B in 30 minutes, then employing a second worker to carry bricks should see the job completed in just 15 minutes. Adding a third worker should decrease the time to 10 minutes. Job parallelism likewise coordinates between multiple serial workers to finish a computation more quickly than if it had been done by a single worker. Parallelism can take many forms, the two most prevalent being threading and message passing. Popular implementations of threading and message passing are the OpenMP and MPI standards.

Sometimes a more loosely-coupled form of parallelism can be used by a job. Suppose a user has a collection of 100 files, each containing the full text of a novel. The user would like to run a program for each file that counts the number of gerunds occurring in the text. The counting program is a simple serial program, but the task can be completed more quickly by analyzing many files concurrently. This form of parallelism requires no threading or message passing, and in Grid Engine parlance is called an array job.

Grid Engine uses parallel environments to facilitate the scheduling of jobs that use parallelism. Each queue has a list of parallel environments for which it will accept jobs; any job requesting a parallel environment not listed will not run in that queue. Available parallel environments are displayed using the qconf command:

[(it_css:traine)@farber it_css]$ qconf -spl
generic-mpi
mvapich2
openmpi
threads

Programs that use OpenMP or some other form of thread parallelism should use the "threads" parallel environment. This environment logically limits jobs to run on a single node only, which in turn limits the maximum number of workers to be the CPU core count for a node.

The "generic-mpi" parallel environment should be used in general for jobs that make use of MPI parallelism. This parallel environment spans multiple nodes and allocates workers by "filling-up" one node before moving on to another. When a job starts in this parallel environment, an MPI "machines" file is automatically manufactured and placed in the job's temporary directory at ${TMPDIR}/machines. This file should be copied to a job's working directory or passed directly to the mpirun/mpiexec command used to execute the MPI program.

Software that uses MPI but is not started using mpirun or mpiexec will often have arguments or environment variables which can be set to indicate on which hosts the job should run or what file to consult for that list. Please consult software manuals and online support resources before contacting UD IT for help determining how to pass this information to the program.

Some MPI implementations are tightly integrated with Grid Engine and do not need a "machines" file. The "mvapich2" and "openmpi" parallel environments shown in the list above are two such examples. MPI programs compiled with these libraries should use the appropriate variant-specific MPI parallel environment.

After choosing the appropriate parallel environment for a job, the -pe option must be supplied to the qsub or qlogin command. This option has two required arguments: the p.e. name and the number of workers requested:

qsub ... -pe openmpi 96 ...
Like any command-line argument to qsub, the parallel environment option can be specified inside the job script using the #$ -pe … line format.

When a parallel job executes, the following environment variables will be set by Grid Engine:

VariableDescription
NSLOTSThe number of slots granted to the job. OpenMP jobs should assign the value of $NSLOTS to the OMP_NUM_THREADS environment variable, for example.
NHOSTSThe number of hosts spanned by the job.

Detailed information pertaining to individual kinds of parallel jobs – like setting the OMP_NUM_THREADS environment variable to $NSLOTS for OpenMP programs – are provided by UD IT in a collection of job template scripts on a per-cluster basis under the /opt/shared/templates directory. For example, on farber this directory looks like:

[(it_css:traine)@farber ~]$ ls -l /opt/shared/templates
total 4
drwxr-sr-x 7 frey _sgeadm 104 Jul 17 08:11 dev-projects
drwxrwsr-x 3 frey _sgeadm  43 Apr 13 08:38 gaussian
drwxrwsr-x 3 frey _sgeadm  38 Apr 13 08:38 generic-mpi
drwxrwsr-x 3 frey _sgeadm  34 Apr 13 08:38 gromacs
drwxrwsr-x 3 frey _sgeadm  35 Apr 13 08:38 mvapich2
drwxrwsr-x 3 frey _sgeadm  33 Apr 13 08:38 openmp
drwxrwsr-x 3 frey _sgeadm  84 Sep 10 10:11 openmpi
-rw-rw-r-- 1 frey _sgeadm 536 Apr 13 08:38 serial.qs

The directory layout is self-explanatory: script templates specific to OpenMP, Open MPI, and MVAPICH2 are in their own subdirectories; a generic MPI job script can be found in the generic-mpi directory; a template for serial jobs is in serial.qs. The scripts are heavily documented to aid in users' choice of appropriate templates.

An array job essentially runs the same job by generating a new repeated task many times. Each time, the environment variable SGE_TASK_ID is set to a sequence number by Grid Engine and its value provides input to the job submission script.

The $SGE_TASK_ID is the key to make the array jobs useful. Use it in your bash script, or pass it as a parameter so your program can decide how to complete the assigned task.

For example, the $SGE_TASK_ID sequence values of 2, 4, 6, … , 5000 might be passed as an initial data value to 2500 repetitions of a simulation model. Alternatively, each iteration (task) of a job might use a different data file with filenames of data$SGE_TASK_ID (i.e., data1, data2, data3, ', data2000).

The general form of the qsub option is:

-t start_value - stop_value : step_size

with a default step_size of 1. For these examples, the option would be:

-t 2-5000:2 and -t 1-2000

Additional simple how-to examples for array jobs.

If you have a multiple jobs where you want to automatically run other job(s) after the execution of another job, then you can use chaining. When you chain jobs, remember to check the status of the other job to determine if it successfully completed. This will prevent the system from flooding the scheduler with failed jobs. Here is a simple chaining example with three job scripts doThing1.qs, doThing2.qs and doThing3.qs.

doThing1.qs
#$ -N doThing1
#
# If you want an email message to be sent to you when your job ultimately
# finishes, edit the -M line to have your email address and change the
# next two lines to start with #$ instead of just #
# -m eas
# -M my_address@mail.server.com
#
# Setup the environment; add vpkg_require commands after this
# line:

# Now append all of your shell commands necessary to run your program
# after this line:
 ./dotask1
doThing2.qs
#$ -N doThing2
#$ -hold_jid doThing1
#
# If you want an email message to be sent to you when your job ultimately
# finishes, edit the -M line to have your email address and change the
# next two lines to start with #$ instead of just #
# -m eas
# -M my_address@mail.server.com
#
# Setup the environment; add vpkg_require commands after this
# line:

# Now append all of your shell commands necessary to run your program
# after this line:

# Here is where you should add a test to make sure
# that dotask1 successfully completed before running
# ./dotask2
# You might check if a specific file(s) exists that you would
# expect after a successful dotask1 run, something like this
#  if [ -e dotask1.log ] 
#      then ./dotask2
#  fi
# If dotask1.log does not exist it will do nothing.
# If you don't need a test, then you would run the task.
 ./dotask2
doThing3.qs
#$ -N doThing3
#$ -hold_jid doThing2
#
# If you want an email message to be sent to you when your job ultimately
# finishes, edit the -M line to have your email address and change the
# next two lines to start with #$ instead of just #
# -m eas
# -M my_address@mail.server.com
#
# Setup the environment; add vpkg_require commands after this
# line:

# Now append all of your shell commands necessary to run your program
# after this line:
# Here is where you should add a test to make sure
# that dotask2 successfully completed before running
# ./dotask3
# You might check if a specific file(s) exists that you would
# expect after a successful dotask2 run, something like this
#  if [ -e dotask2.log ] 
#      then ./dotask3
#  fi
# If dotask2.log does not exist it will do nothing.
# If you don't need a test, then just run the task.
 ./dotask3

Now submit all three job scripts. In this example, we are using account traine in workgroup it_css on farber.

[(it_css:traine)@farber ~]$ qsub doThing1.qs
[(it_css:traine)@farber ~]$ qsub doThing2.qs
[(it_css:traine)@farber ~]$ qsub doThing3.qs

The basic flow is doThing2 will wait until doThing1 finishes, and doThing3 will wait until doThing2 finishes. If you test for success, then doThing2 will check to make sure that doThing1 was successful before running, and doThing3 will check to make sure that doThing2 was successful before running.

You might also want to have doThing1 and doThing2 execute at the same time, and only run doThing3 after they finish. In this case you will need to change doThing2 and doThing3 scripts and tests.

doThing2.qs
#$ -N doThing2
#
# If you want an email message to be sent to you when your job ultimately
# finishes, edit the -M line to have your email address and change the
# next two lines to start with #$ instead of just #
# -m eas
# -M my_address@mail.server.com
#
# Setup the environment; add vpkg_require commands after this
# line:

# Now append all of your shell commands necessary to run your program
# after this line:
 ./dotask2
doThing3.qs
#$ -N doThing3
#$ -hold_jid doThing1,doThing2
#
# If you want an email message to be sent to you when your job ultimately
# finishes, edit the -M line to have your email address and change the
# next two lines to start with #$ instead of just #
# -m eas
# -M my_address@mail.server.com
#
# Setup the environment; add vpkg_require commands after this
# line:

# Now append all of your shell commands necessary to run your program
# after this line:
# Here is where you should add a test to make sure
# that dotask1 and dotask2 successfully completed before running
# ./dotask3
# You might check if a specific file(s) exists that you would
# expect after a successful dotask1 and dotask2 run, something like this
#  if [ -e dotask1.log -a -e dotask2.log ];
#      then ./dotask3
#  fi
# If both files do not exist it will do nothing.
# If you don't need a test, then just run the task.
 ./dotask3

Now submit all three jobs again. However this time doThing1 and doThing2 will run at the same time, and only when they are both finished, will doThing3 run. doThing3 will check to make sure doThing1 and doThing2 are successful before running.

Hearkening back to the text-processing example cited above, the analysis of each of the 100 files could be performed by submitting 100 separate jobs to Grid Engine, each modified to work on a different file. Using an array job helps to automate this task: each sub-task of the array job is assigned a unique integer identifier. Each sub-task can find its sub-task identifier in the SGE_TASK_ID environment variable. Consider the following:

[(it_css:traine)@farber it_css]$ qsub -N array -t 1-4 -o 'array.$TASK_ID'
echo "I am sub-task ${JOB_ID}.${SGE_TASK_ID}"
^D
Your job-array 82709.1-4:1 ("array") has been submitted
[(it_css:traine)@farber it_css]$ ...time passes...
[(it_css:traine)@farber it_css]$ ls -1 array.*
array.1
array.2
array.3
array.4
[(it_css:traine)@farber it_css]$ cat array.3
I am sub-task 82709.3

Four sub-tasks are executed, numbered from 1 through 4. The starting index must be greater than zero, and the ending index must be greater than or equal to the starting index. The step size going from one index to the next defaults to one, but can be any positive integer greater than zero. A step size is appended to the sub-task range as in 2-20:2 – proceed from 2 up to 20 in steps of 2, e.g. 2, 4, 6, 8, 10, et al.

There are essentially two methods for partitioning input data for array jobs. Both methods make use of the sub-task identifier in locating the input for a particular sub-task.

If the 100 novels were in files with names fitting the pattern novel_«sub-task-id».txt then the analysis could be performed with the following qsub command:

[(it_css:traine)@farber novels]$ qsub -N gerunds -o 'gerund_count.$TASK_ID' -t 1-100
#
# Count gerunds in the file:
#
./gerund_count "novel_${SGE_TASK_ID}.txt"
 
^D
Your job-array 82715.1-100:1 ("gerunds") has been submitted

When complete, the job will produce 100 files named gerund_count.«sub-task-id» where the sub-task-id collates the results to the input files.

An alternate method of organizing the chaos associated with large array jobs is to partition the data in directories: the sub-task identifier is not applied to the filenames, but is used to set the working directory for each sub-task:

[(it_css:traine)@farber novels]$ qsub -N gerunds -o gerund_count -t 1-100
#
# Count gerunds in the file:
#
cd ${SGE_TASK_ID}
../gerund_count novel.txt > gerund_count
 
^D
Your job-array 82716.1-100:1 ("gerunds") has been submitted

When complete, each directory will have a file named gerund_count containing the output of the gerund_count command.

Using an Index File

The partitioning scheme can be as complex as the user desires. If the directories were not named "1" through "100" but instead used the name of the novel contained within, an index file could be created containing the directory names, one per line:

Great_Expectations
Atlas_Shrugged
The_Great_Gatsby
  :

The job submission might then look like:

[(it_css:traine)@farber novels]$ qsub -N gerunds -o gerund_count -t 1-100
#
# Count gerunds in the file:
#
NOVEL_FOR_TASK=`sed -n ${SGE_TASK_ID}p index.txt`
cd $NOVEL_FOR_TASK
../gerund_count novel.txt > gerund_count
 
^D
Your job-array 82718.1-100:1 ("gerunds") has been submitted

The sed command selects a single line of the index.txt file; for sub-task 1 the first line is selected, sub-task 2 the second line, etc.

  • abstract/farber/runjobs/schedule_jobs.txt
  • Last modified: 2021-04-27 16:21
  • by 127.0.0.1