====== Scheduling Jobs on Farber ======
In order to schedule any job (interactively or batch) on a cluster, you must set your **[[/abstract/farber/app_dev/compute_env#using-workgroup-and-directories|workgroup]]** to define your cluster group or //investing-entity// compute nodes.
===== Interactive jobs (qlogin) =====
As discussed, an //interactive job// allows a user to enter a sequence of commands manually. The following qualify as being interactive jobs:
* A program with a GUI: e.g. creating graphs in Matlab
* A program that requires manual input: e.g. a menu-driven post-processing program
* Any task that is more easily performed manually
As far as the final bullet point goes, suppose a user has a long-running batch job and must later extract results from its output using a single command that will execute for a short time (say five minutes). While the user could go to the effort of creating a batch job, it may be easier to just run the command interactively and visually note its output.
All interactive jobs should be scheduled to run on the compute nodes, not the login/head node.
An interactive session (job) can often be made non-interactive ([[abstract:farber:runjobs:schedule_jobs#interactive-jobs-qlogin|interactive job]]) by putting the input in a file, using the redirection symbols **<** and **>**, and making the entire command a line in a job script file:
//program_name// < //input_command_file// > //output_command_file//
Then the non-interactive ([[abstract:farber:runjobs:schedule_jobs#batch-jobs-qsub|batch job]]) job can be scheduled as a batch job.
=== Starting an interactive session ===
Remember you must specify your **[[/abstract/farber/app_dev/compute_env#using-workgroup-and-directories|workgroup]]** to define your cluster group or //investing-entity// compute nodes before submitting any job, and this includes starting an interactive session. Now use the Grid Engine command **qlogin** on the login (head) node. Grid Engine will look for a node with a free //scheduling slot// (processor core) and a sufficiently light load, and then assign your session to it. If no such node becomes available, your **qlogin** request will eventually time out. The **qlogin** command results in a job in the workgroup interactive serial queue, **<**//**investing_entity**//**>-****qrsh.q**.
Type
workgroup -g //investing-entity//
Type
qlogin
to reserve one scheduling slot and start an interactive shell on one of your workgroup //investing-entity// compute nodes.
Type
qlogin –pe threads 12
to reserve 12 scheduling slots and start an interactive shell on one your workgroup //investing-entity// compute node.
Type
exit
to terminate the interactive shell and release the scheduling slot(s).
=== Acceptable nodes for interactive sessions ===
Use the login (head) node for interactive program development including Fortran, C, and C++ program compilation. Use Grid Engine (**qlogin**) to start interactive shells on your workgroup //investing-entity// compute nodes.
==== Submitting an Interactive Job ====
In Grid Engine, interactive jobs are submitted to the job scheduler using the ''qlogin'' command:
[(it_css:traine)@farber it_css]$ qlogin
Your job 78731 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 78731 has been successfully scheduled.
Establishing /opt/shared/GridEngine/local/qlogin_ssh session to host n013 ...
[traine@n013 it_css]$
Dissecting this text, we see that:
- the job was assigned a numerical //job identifier// or //job id// of 78731
- the job is named "QLOGIN"
- the job is executing on compute node ''n013''
- the final line is a shell prompt, running on ''n013'' and waiting for commands to be typed
What is not apparent from the text:
* the shell prompt on compute node ''n013'' has as its working directory the directory in which the ''qlogin'' command was typed (''it_css'')
* if resources had not been immediately available to this job, the text would have "hung" at "''waiting for interactive job to be scheduled ...''" and later resumed with the message about its being successfully scheduled
To execute an interactive job __only__ if resources are immediately available, use the ''-now'' command-line flag:
[(it_css:traine)@farber it_css]$ qlogin -now y
Your job 78735 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...timeout (4 s) expired while waiting on socket fd 4
Your "qlogin" request could not be scheduled, try again later.
[(it_css:traine)@farber it_css]$
==== Naming your Job ====
By default an interactive job submitting using ''qlogin'' is given a name of "QLOGIN." This can get confusing if a user has many interactive jobs submitted at one time. Taking a moment to name each interactive job according to its purpose may save the user a lot of effort later:
[(it_css:traine)@farber it_css]$ qlogin -N 'Matlab graphs'
Your job 78737 ("Matlab graphs") has been submitted
waiting for interactive job to be scheduled ...
Your interactive job 78737 has been successfully scheduled.
Establishing /opt/shared/GridEngine/local/qlogin_ssh session to host n013 ...
[traine@n013 it_css]$
The name provided with the ''-N'' command-line option will appear in job status listings (see the next section).
===== Batch Jobs (qsub) =====
Prerequisite to the submission of //batch jobs// to the job scheduler is the writing of a //job script//. Grid Engine job scripts follow the same form as shell scripts, with a few exceptions:
* the "interpreter line" can be omitted
* the file need not have its executable bit set
* comment lines starting with "''#$''" act like command-line options when the job is submitted
The first two points actually keep a job script "safer" because the script cannot be mistakenly executed on the head node. When a batch job is submitted Grid Engine makes a copy of the script and removes the executable bit for exactly this reason.
The simplest possible job script looks something like this:
echo "Hello, world."
As Grid Engine is configured on UD clusters, this job script would be executed within a BASH shell. To use a different shell, the ''-S'' command-line option can be embedded in the job script:
#$ -S /bin/tcsh
echo "Hello, world."
==== Submitting a Batch Job ====
Grid Engine provides the **qsub** command for scheduling batch jobs:
^ command ^ Action ^
| ''qsub'' </command_line_options//>> </job_script//>> | Submit job with script command in the file </job_script//>> |
For example,
[(it_css:traine)@farber it_css]$ qsub job_script_01.qs
Your job 78742 ("job_script_01.qs") has been submitted
Notice that the job name defaults to being the name of the job script; as discussed in the previous section, a job name can also be explicitly provided
#$ -N testing002
echo "Hello, world."
which when submitted would yield
[(it_css:traine)@farber it_css]$ qsub job_script_02.qs
Your job 78745 ("testing002") has been submitted
==== Specifying Options on the Command Line ====
It has already been demonstrated that command-line options to the ''qsub'' command can be embedded in a job script. Likewise, the options can be specified on the command line. For example:
[(it_css:traine)@farber it_css]$ qsub -N 'testingtoo' job_script_02.qs
Your job 78748 ("testingtoo") has been submitted
The ''-N'' option was provided in the queue script __and__ on the command line itself: Grid Engine will honor options from the command line in preference to those embedded in the script. Thus, in this case the "''testingtoo''" provided on the command line overrode the "''testing002''" from the job script.
The ''qsub'' command has many options available, all of which are documented in its man page. A few of the often-used options will be discussed here.
=== Default Options ===
There are several default options that are automatically added to every ''qsub'' by Grid Engine:
^Option^Discussion^
|''-j y''|Regular (stdout) and error (stderr) output emitted by the job script should go to a single file|
|''-cwd''|When the job executes, its working directory should be the working directory at the time of job submission|
|''-w w''|Grid Engine checks submitted jobs to ensure that at least one queue will accept them; this option indicates that jobs with no valid queue produce a warning and remain queued|
There are default resource requirements supplied, as well, but they are beyond the scope of this section. Providing an alternate value for any of these arguments -- in the job script or on the ''qsub'' command line -- overrides the default value.
=== Email Notifications ===
Since batch jobs can run unattended, the user may want to be notified of status changes for a job: when the job begins executing; when the job finishes; or if the job was killed. Grid Engine will deliver such notifications (as emails) to a job's owner if the owner requests them using the ''-m'' option. This option has a single argument, consisting of letters indicating the state changes for which notifications should be delivered:
^Letter^State Change^
|b|The job has started executing|
|e|The job has completed execution without error|
|a|The job aborted or was rescheduled|
|s|The job was suspended|
To receive notification when the job is finished -- successfully or in error -- the user would specify ''-m ea'' either on the command line or in the job script. The user should supply the target email address for these notifications using the ''-M'' option:
#$ -N 'Sample job'
#$ -m ea
#$ -M traine@gmail.com
echo "Hello, world."
=== Scheduling in the Future ===
Some jobs may only be eligible for execution after a certain date and time have passed. While a user could wait until that time has arrived to submit the job, Grid Engine also allows a job to be submitted with a requested start time. Grid Engine will do its best to meet that date and time. For example, on September 14 a user arranges with an external agent to copy a weather data file to the cluster around 6:00 p.m. on September 20. The user wishes to process the data (allowing 30 minutes for the file transfer to complete) as soon as possible. On September 14, the user could submit a batch job to be executed in the future:
[(it_css:traine)@farber it_css]$ qsub -a 201209201830 process_weather.qs
Your job 78758 ("process_weather.qs") has been submitted
where the argument to the ''-a'' option is in the form ''YYYYMMDDHHmm'' (year, month, day, hour, minute).
==== Job Output ====
Equally as important as executing the job is capturing any output produced by the job. As mentioned above, the ''-j y'' option sends all output (stdout and stderr) to a single file. By default, that output file is named according to the formula
[job name].o[job id]
For the weather-processing example above, the output would be found in
[(it_css:traine)@farber it_css]$ qsub -a 201209201830 process_weather.qs
Your job 78758 ("process_weather.qs") has been submitted
[(it_css:traine)@farber it_css]$
#
# ... some time goes by ...
#
[(it_css:traine)@farber it_css]$ ls *.o*
process_weather.qs.o78758
In the job script itself it is often counterproductive to redirect a constituent command's output to a file. Allowing all output to stdout/stderr to be directed to the file provided by Grid Engine automatically provides a degree of "versioning" of all runs of the job by way of the ''.o[job id]'' suffix on the output file's name.
The name of the output file can be overridden using the ''-o'' command-line option to ''qsub''. The argument to this option is the name of the file, possibly containing special characters that will be replaced by the job id, job name, etc. See the ''qsub'' man page for a complete description.
If the user overrides the default joining of regular and error output to a single file (using ''-y n''), the error output is directed to a file named as described above but with a ''.e[job id]'' suffix. Likewise, an explicit filename can be provided using the ''-e'' option.
==== Forgetting the Filename ====
A user may mistakenly omit the script filename from the ''qsub'' command. Surprisingly, ''qsub'' does not complain in such a situation; instead, it pauses and allows the user to type a script:
[(it_css:traine)@farber it_css]$ qsub
#
# Oops, I forgot to provide a job script to qsub!
#
echo "Oops, I did it again."
^D
Your job 78774 ("STDIN") has been submitted
[(it_css:traine)@farber it_css]$
#
# ... some time goes by ...
#
[(it_css:traine)@farber it_css]$ cat STDIN.o78774
Oops, I did it again.
The "''^D''" represents holding down the "control" key and pressing the "D" key; this signals "end of file" and lets ''qsub'' know that the user is done entering lines of text. By default, a batch job submitted in this fashion will be named "''STDIN''".
===== More details about using qsub =====
For example,
qsub myproject.qs
or to submit a standby job that waits for idle nodes (up to 240 slots for 8 hours),
qsub -l standby=1 myproject.qs
or to submit a standby job that waits for idle 48-core nodes (if you are using a cluster with 48-core nodes like farber)
qsub -l standby=1 -q standby.q@@48core myproject.qs
or to submit a standby job that waits for idle 24-core nodes, (would not be assigned to any 48-core nodes; important for consistency of core assignment)
qsub -l standby=1 -q standby.q@@24core myproject.qs
or to submit to the four hour standby queue (up to 816 slots spanning all nodes)
qsub -l standby=1,h_rt=4:00:00 myproject.qs
or to submit to the four hour standby queue spanning just the 24-core nodes.
qsub -l standby=1,h_rt=4:00:00 -q standby-4h.q@@24core myproject.qs
This file ''myproject.qs'' will contain bash shell commands and **qsub** statements that include **qsub** options and resource specifications. The **qsub** statements begin with #$.
We strongly recommend that you use a script file that you pattern after the prototypes in **/opt/shared/templates** and save your job script files within a **$WORKDIR** (private work) directory.
Reusable job scripts help you maintain a consistent batch environment across runs. The optional **.qs** filename suffix signifies a **q**ueue-**s**ubmission script file.
See also [[:abstract:farber:runjobs:schedule_jobs#resource-management-options-on-farber|resource options]] to specify memory free and/or available, [[abstract:farber:runjobs:queues#farber-exclusive-access|exclusive]] access, and requesting specific [[:software:matlab:matlab#license-information|Matlab licenses]].
==== Grid Engine environment variables ====
In every batch session, Grid Engine sets environment variables that are useful within job scripts. Here are some common examples. The rest appear in the ENVIRONMENTAL VARIABLES section of the **qsub**** man** page.////
^ Environment variable ^ Contains ^
| **HOSTNAME** | Name of the execution (compute) node |
| **JOB_ID** | Batch job id assigned by Grid Engine |
| **JOB_NAME** | Name you assigned to the batch job (See [[#command-options-for-qsub|Command options for qsub]]) |
| **NSLOTS** | Number of //scheduling slots// (processor cores) assigned by Grid Engine to this job |
| **SGE_TASK_ID** | Task id of an array job sub-task (See [[#array-jobs|Array jobs]]) |
| **TMPDIR** | Name of directory on the (compute) node scratch filesystem |
When Grid Engine assigns one of your job's tasks to a particular node, it creates a temporary work directory on that node's 1-2 TB local scratch disk. And when the task assigned to that node is finished, Grid Engine removes the directory and its contents. The form of the directory name is
**/scratch/[$JOB_ID].[$SGE_TASK_ID].</queue_name//>>**
For example after ''qlogin'' type
echo $TMPDIR
to see the name of the node scratch directory for this interactive job.
/scratch/71842.1.it_css-qrsh.q
See [[:clusters:farber:filesystems|Filesystems]] and [[:abstract:farber:app_dev:compute_env|Computing environment]] for more information about the node scratch filesystem and using environment variables.
Grid Engine uses these environment variables' values when creating the job's output files:
^ File name patter ^ Description ^
| [$JOB_NAME].o[$JOB_ID] | Default **output** filename |
| [$JOB_NAME].e[$JOB_ID] | **error** filename (when not joined to output) |
| [$JOB_NAME].po[$JOB_ID] | Parallel job **output** output (Empty for most queues) |
| [$JOB_NAME].pe[$JOB_ID] | Parallel job **error** filename (Usually empty) |
==== More options for qsub ====
The most commonly used **qsub** options fall into two categories: //operational// and //resource-management//. The operational options deal with naming the output files, mail notification of the processing steps, sequencing of a series of jobs, and establishing the UNIX environment. The resource-management options deal with the specific system resources you desire or need, such as parallel programming environments, number of processor cores, maximum CPU time, and virtual memory needed.
The table below lists **qsub**'s common operational options.
^ Option / Argument ^ Function ^
| ''-N'' </job_name//>> | Names the job /job_name//>. Default: the job script's full filename. |
| ''-m'' {b%%|%%e%%|%%a%%|%%s%%|%%n} | Specifies when e-mail notifications of the job's status should be sent: **b**eginning, **e**nd, **a**bort, **s**uspend. Default: **n**ever |
| ''-M'' </email_address//>> | Specifies the email address to use for notifications. |
| ''-j'' {y%%|%%n} | Joins (redirects) the //STDERR// results to //STDOUT//. Default: **y**(//yes//) |
| ''-o'' </output_file//>> | Directs job output //STDOUT// to /output_file//>. Default: see [[#grid-engine-environment-variables|Grid Engine environment variables]] |
| ''-e'' </error_file//>> | Directs job errors (//STDERR//) to /error_file//>. File is only produced when the **qsub** option **–j n** is used. |
| ''-hold_jid'' /job_list//> | Holds job until the jobs named in /job_list//> are completed. Job may be listed as a list of comma-separated numeric job ids or job names. |
| ''-t'' </task_id_range//>> | Used for //array// jobs. See [[#array-jobs|Array jobs]] for details. |
^ Special notes for IT clusters: ^^
| ''-cwd'' | Default. Uses current directory as the job's working directory. |
| ''-V'' | Ignored. Generally, the login node's environment is not appropriate to pass to a compute node. Instead, you must define the environment variables directly in the job script. |
| ''-q'' </queue_name//>> | Not need in most cases. Your choice of resource-management options determine the queue. |
^ The resource-management options for ''qsub'' have two common forms: ^^
| ''-l'' </resource//>>''=''</value//>> ||
| ''-pe ''</parallel_environment//>> </Nproc//>> ||
For example, putting the lines
#$ -l h_cpu=1:30:00
#$ –pe threads 12
in the job script tells Grid Engine to set a hard limit of 1.5 hours on the CPU time resource for the job, and to assign 12 processors for your job.
Grid Engine tries to satisfy all of the resource-management options you specify in a job script or as qsub command-line options. If there is a queue already defined that accepts jobs having that particular combination of requests, Grid Engine assigns your job to that queue.
===== Resource-management options on Farber =====
You may give a resource request list in the form ''-l resource=value''. A list of available resources with their associated valid value specifiers can be obtained by the command:
qconf -sc
Each named complex or shortcut can be a ''resource''. There can be multiple, comma separated, ''resource=value'' pairs. The valid values are determined by the type. Examples, MEMORY type could be 5G (5 GigaBytes), or a TIME type could be 1:30:00 (1 hour 30 minutes).
In a cluster as large a Farber, the two most important resources are cores (CPUs) and memory. The number of cores is called ''slots''. It is listed as a "requestable" and "consumable" resource. Parallel jobs, by definition, can use multiple cores. Thus, the ''slots'' resource is handled by the [[:abstract:farber:runjobs:schedule_jobs#parallel-environments|parallel-environments]] option ''-pe'', and you do not need to put it in a resource list.
==== Memory ====
For memory you will be concerned about how much is free. Memory resources come as both consumable and sensor driven (not consumable). For example:
^ memory resource ^ Consumable ^ Explanation ^
| m_mem_free | Yes |Memory consumed per CPU DURING execution |
The ''m_mem_free'' is consumable, which means you are reserving the memory for future use. Other jobs, using ''m_mem_free'', may be barred from starting on the node. If you are specifying memory resources for a parallel environment job, the requested memory is multiplied by the slot count. By default, ''m_mem_free'' is defined as 1GB of memory per core (slot), if not specified.
When using a shared memory parallel computing environment ''-pe threads'', divide the total memory needed by the number of slots. For example, to request 48G of shared memory for an 8 thread job, request 6G (6G per slot) i.e.,'-l m_mem_free=6G'
Please note a job error will occur and prevent the queue from accepting jobs when:
- specifying "N gigabytes" but omitting the unit such as "G", e.g. "-l m_mem_free=3"
- putting a space between the number and the unit, e.g. "-l m_mem_free=3 G"
The correct form should be "-l m_mem_free=3G" for this example.
=== Example ===
Consider 30 serial jobs, which each require 20 Gbytes of memory. Use the command
qsub -l m_mem_free=20G -t 1-30 myjob.qs
This will submit 30 jobs to the queue, with the SGE_TASK_ID variable set for use in the ''myjobs.qs'' script (an [[:abstract:farber:runjobs:schedule_jobs#array-jobs|array job]].)
The ''m_mem_free'' resource will tell Grid Engine to not schedule a job on a node unless the specified amount of memory i.e., 20GB per CPU is available to consume on that node. Since this is a serial job that runs on a single CPU,20GB can be termed as total memory available for the job.
==== Parallel environments ====
The ''/opt/shared/templates/gridengine'' directory contains basic prototype job scripts for non-interactive parallel jobs. This section describes the **–pe** parallel environment option that's required for MPI jobs, openMP jobs and other jobs that use the SMP (threads) programming model.
Type the command:
qconf –spl
to display a list of parallel environments available on a cluster.
The general form of the parallel environment option is:
''-pe'' </parallel_environment//>> </Nproc//>>
where </Nproc//>> is the number of processor //slots// (cores) requested. Just use a single number, and not a range. Grid Engine tries to locate as many free slots as it can and assigns them to that batch job. The environment variable ''$NSLOTS'' is given that value.
The two most used parallel environments are **threads** and **mpi**.
=== The threads parallel environment ===
Jobs such as those having openMP directives use the **//threads//** parallel environment, an implementation of the [[:abstract:farber:app_dev:prog_env#programming-models|shared-memory programming model]]. These SMP jobs can only use the cores on a **single** node.
For example, if your group only owns nodes with 24 cores, then your ''–pe threads'' request may only ask for 24 or fewer slots. Use Grid Engine's **qconf** command to determine the names and characteristics of the queues and compute nodes available to your investing-entity group on a cluster.
Threaded jobs do not necessarily complete faster when more slots are made available. Before running a series of production runs, you should experiment to determine how many slots generally perform best. Using that quantity will leave the remaining slots for others in your group to request. Remember: others can see how many slots you're using!
== OpenMP jobs ==
For **openMP** jobs, add the following bash command to your job script:
export OMP_NUM_THREADS=$NSLOTS
IT provides a job script template called ''openmp.qs'' available in ''/opt/shared/templates/gridengine/openmp'' to copy and customize for your OpenMP jobs.
=== The mpi parallel environment ===
MPI jobs inherently generate considerable network traffic among the processor cores of a cluster's compute nodes. The processors on the compute node may be connected by two types of networks: InfiniBand and Gigabit Ethernet.
IT has developed templates to help with the **openmpi** parallel environments for Farber, targeting different user needs and architecture. You can copy the templates from ''/opt/shared/templates/gridengine/openmpi'' and customize them. These templates are essentially identical with the exception of the presence or absence of certain **qsub** options and the values assigned to **MPI_FLAGS** based on using particular environment variables. In all cases, the parallel environment option must be specified:
''-pe mpi'' </NPROC//>>
where /NPROC//> is the number of //processor slots// (cores) requested. Use a single number, not a range. Grid Engine tries to locate as many free slots as it can and assigns them to that job. The environment variable **NSLOTS** is given that value.
IT provides several job script templates in ''/opt/shared/templates/gridengine/openmpi'' to copy and customize for your Open MPI jobs. See [[software:openmpi:farber|Open MPI on Farber]] for more details about these job scripts.
==== Co-processor Access: GPU and PHI ====
Using the resource option ''-l nvidia_gpu=1'' or ''-l gpu=1'' will schedule your job on a host with a GPU co-processor and blocks any other jobs from using it at the same time.
Using the resource option ''-l intel_phi=1'' or ''-l phi=1'' will schedule your job on a host with a PHI co-processor and blocks any other jobs from using it at the same time.
This doesn't give you exclusive access to the node, only the co-processor on the node. See [[:abstract:farber:runjobs:queues#farber-exclusive-access|exclusive access]] for details blocking any other jobs on the node.
====== Running Jobs with Parallelism ======
The interactive and batch jobs discussed thus far have all been //serial// in nature: they exist as a sequence of instructions executed in order on a single CPU core. Many problems solved on a computer can be solved more quickly by breaking the job into pieces that can be solved concurrently. If one worker moves a pile of bricks from point A to point B in 30 minutes, then employing a second worker to carry bricks should see the job completed in just 15 minutes. Adding a third worker should decrease the time to 10 minutes. //Job parallelism// likewise coordinates between multiple serial workers to finish a computation more quickly than if it had been done by a single worker. Parallelism can take many forms, the two most prevalent being threading and message passing. Popular implementations of threading and message passing are the [[http://openmp.org/wp/|OpenMP]] and [[http://www.mpi-forum.org/|MPI]] standards.
Sometimes a more loosely-coupled form of parallelism can be used by a job. Suppose a user has a collection of 100 files, each containing the full text of a novel. The user would like to run a program for each file that counts the number of gerunds occurring in the text. The counting program is a simple serial program, but the task can be completed more quickly by analyzing many files concurrently. This form of parallelism requires no threading or message passing, and in Grid Engine parlance is called an //[[#array-jobs|array job]]//.
Grid Engine uses //parallel environments// to facilitate the scheduling of jobs that use parallelism. Each queue has a list of parallel environments for which it will accept jobs; any job requesting a parallel environment not listed will not run in that queue. Available parallel environments are displayed using the ''qconf'' command:
[(it_css:traine)@farber it_css]$ qconf -spl
generic-mpi
mvapich2
openmpi
threads
===== Threads =====
Programs that use OpenMP or some other form of thread parallelism should use the "threads" parallel environment. This environment logically limits jobs to run on a single node only, which in turn limits the maximum number of workers to be the CPU core count for a node.
===== MPI =====
The "generic-mpi" parallel environment should be used in general for jobs that make use of MPI parallelism. This parallel environment spans multiple nodes and allocates workers by "filling-up" one node before moving on to another. When a job starts in this parallel environment, an MPI "machines" file is automatically manufactured and placed in the job's temporary directory at ''${TMPDIR}/machines''. This file should be copied to a job's working directory or passed directly to the ''mpirun''/''mpiexec'' command used to execute the MPI program.
Software that uses MPI but is not started using ''mpirun'' or ''mpiexec'' will often have arguments or environment variables which can be set to indicate on which hosts the job should run or what file to consult for that list. Please consult software manuals and online support resources before contacting UD IT for help determining how to pass this information to the program.
Some MPI implementations are //tightly integrated// with Grid Engine and do not need a "machines" file. The "mvapich2" and "openmpi" parallel environments shown in the list above are two such examples. MPI programs compiled with these libraries should use the appropriate variant-specific MPI parallel environment.
===== Submitting a Parallel Job =====
After choosing the appropriate parallel environment for a job, the ''-pe'' option must be supplied to the ''qsub'' or ''qlogin'' command. This option has two required arguments: the p.e. name and the number of workers requested:
qsub ... -pe openmpi 96 ...
Like any command-line argument to ''qsub'', the parallel environment option can be specified inside the job script using the ''#$ -pe ...'' line format.
When a parallel job executes, the following environment variables will be set by Grid Engine:
^Variable^Description^
|''NSLOTS''|The number of slots granted to the job. OpenMP jobs should assign the value of ''$NSLOTS'' to the ''OMP_NUM_THREADS'' environment variable, for example.|
|''NHOSTS''|The number of hosts spanned by the job.|
==== Job Templates ====
Detailed information pertaining to individual kinds of parallel jobs -- like setting the ''OMP_NUM_THREADS'' environment variable to ''$NSLOTS'' for OpenMP programs -- are provided by UD IT in a collection of job template scripts on a per-cluster basis under the ''/opt/shared/templates'' directory. For example, on farber this directory looks like:
[(it_css:traine)@farber ~]$ ls -l /opt/shared/templates
total 4
drwxr-sr-x 7 frey _sgeadm 104 Jul 17 08:11 dev-projects
drwxrwsr-x 3 frey _sgeadm 43 Apr 13 08:38 gaussian
drwxrwsr-x 3 frey _sgeadm 38 Apr 13 08:38 generic-mpi
drwxrwsr-x 3 frey _sgeadm 34 Apr 13 08:38 gromacs
drwxrwsr-x 3 frey _sgeadm 35 Apr 13 08:38 mvapich2
drwxrwsr-x 3 frey _sgeadm 33 Apr 13 08:38 openmp
drwxrwsr-x 3 frey _sgeadm 84 Sep 10 10:11 openmpi
-rw-rw-r-- 1 frey _sgeadm 536 Apr 13 08:38 serial.qs
The directory layout is self-explanatory: script templates specific to OpenMP, Open MPI, and MVAPICH2 are in their own subdirectories; a generic MPI job script can be found in the ''generic-mpi'' directory; a template for serial jobs is in ''serial.qs''. The scripts are heavily documented to aid in users' choice of appropriate templates.
==== Array jobs ====
An array job essentially runs the same job by generating a new repeated task many times. Each time, the environment variable **SGE_TASK_ID** is set to a sequence number by Grid Engine and its value provides input to the job submission script.
The ''$SGE_TASK_ID'' is the key to make the array jobs useful. Use it in your bash script, or pass it as a parameter so your program can decide how to complete the assigned task.
For example, the ''$SGE_TASK_ID'' sequence values of 2, 4, 6, ... , 5000 might be passed as an initial data value to 2500 repetitions of a simulation model. Alternatively, each iteration (task) of a job might use a different data file with filenames of ''data$SGE_TASK_ID'' (i.e., data1, data2, data3, ', data2000).
The general form of the **qsub** option is:
-t //start_value// - //stop_value// : //step_size//
with a default step_size of 1. For these examples, the option would be:
-t 2-5000:2 and -t 1-2000
Additional simple how-to examples for [[http://wiki.gridengine.info/wiki/index.php/Simple-Job-Array-Howto|array jobs]].
==== Chaining jobs ====
If you have a multiple jobs where you want to automatically run other job(s) after the execution of another job, then you can use chaining. When you chain jobs, remember to check the status of the other job to determine if it successfully completed. This will prevent the system from flooding the scheduler with failed jobs. Here is a simple chaining example with three job scripts ''doThing1.qs'', ''doThing2.qs'' and ''doThing3.qs''.
#$ -N doThing1
#
# If you want an email message to be sent to you when your job ultimately
# finishes, edit the -M line to have your email address and change the
# next two lines to start with #$ instead of just #
# -m eas
# -M my_address@mail.server.com
#
# Setup the environment; add vpkg_require commands after this
# line:
# Now append all of your shell commands necessary to run your program
# after this line:
./dotask1
#$ -N doThing2
#$ -hold_jid doThing1
#
# If you want an email message to be sent to you when your job ultimately
# finishes, edit the -M line to have your email address and change the
# next two lines to start with #$ instead of just #
# -m eas
# -M my_address@mail.server.com
#
# Setup the environment; add vpkg_require commands after this
# line:
# Now append all of your shell commands necessary to run your program
# after this line:
# Here is where you should add a test to make sure
# that dotask1 successfully completed before running
# ./dotask2
# You might check if a specific file(s) exists that you would
# expect after a successful dotask1 run, something like this
# if [ -e dotask1.log ]
# then ./dotask2
# fi
# If dotask1.log does not exist it will do nothing.
# If you don't need a test, then you would run the task.
./dotask2
#$ -N doThing3
#$ -hold_jid doThing2
#
# If you want an email message to be sent to you when your job ultimately
# finishes, edit the -M line to have your email address and change the
# next two lines to start with #$ instead of just #
# -m eas
# -M my_address@mail.server.com
#
# Setup the environment; add vpkg_require commands after this
# line:
# Now append all of your shell commands necessary to run your program
# after this line:
# Here is where you should add a test to make sure
# that dotask2 successfully completed before running
# ./dotask3
# You might check if a specific file(s) exists that you would
# expect after a successful dotask2 run, something like this
# if [ -e dotask2.log ]
# then ./dotask3
# fi
# If dotask2.log does not exist it will do nothing.
# If you don't need a test, then just run the task.
./dotask3
Now submit all three job scripts. In this example, we are using account ''traine'' in workgroup ''it_css'' on farber.
[(it_css:traine)@farber ~]$ qsub doThing1.qs
[(it_css:traine)@farber ~]$ qsub doThing2.qs
[(it_css:traine)@farber ~]$ qsub doThing3.qs
The basic flow is ''doThing2'' will wait until ''doThing1'' finishes, and ''doThing3'' will wait until ''doThing2'' finishes. If you test for success, then ''doThing2'' will check to make sure that ''doThing1'' was successful before running, and ''doThing3'' will check to make sure that ''doThing2'' was successful before running.
You might also want to have ''doThing1'' and ''doThing2'' execute at the same time, and only run ''doThing3'' after they finish. In this case you will need to change ''doThing2'' and ''doThing3'' scripts and tests.
#$ -N doThing2
#
# If you want an email message to be sent to you when your job ultimately
# finishes, edit the -M line to have your email address and change the
# next two lines to start with #$ instead of just #
# -m eas
# -M my_address@mail.server.com
#
# Setup the environment; add vpkg_require commands after this
# line:
# Now append all of your shell commands necessary to run your program
# after this line:
./dotask2
#$ -N doThing3
#$ -hold_jid doThing1,doThing2
#
# If you want an email message to be sent to you when your job ultimately
# finishes, edit the -M line to have your email address and change the
# next two lines to start with #$ instead of just #
# -m eas
# -M my_address@mail.server.com
#
# Setup the environment; add vpkg_require commands after this
# line:
# Now append all of your shell commands necessary to run your program
# after this line:
# Here is where you should add a test to make sure
# that dotask1 and dotask2 successfully completed before running
# ./dotask3
# You might check if a specific file(s) exists that you would
# expect after a successful dotask1 and dotask2 run, something like this
# if [ -e dotask1.log -a -e dotask2.log ];
# then ./dotask3
# fi
# If both files do not exist it will do nothing.
# If you don't need a test, then just run the task.
./dotask3
Now submit all three jobs again. However this time ''doThing1'' and ''doThing2'' will run at the same time, and only when they are both finished, will ''doThing3'' run. ''doThing3'' will check to make sure ''doThing1'' and ''doThing2'' are successful
before running.
Hearkening back to the text-processing example cited above, the analysis of each of the 100 files could be performed by submitting 100 separate jobs to Grid Engine, each modified to work on a different file. Using an array job helps to automate this task: each //sub-task// of the array job is assigned a unique integer identifier. Each sub-task can find its sub-task identifier in the ''SGE_TASK_ID'' environment variable. Consider the following:
[(it_css:traine)@farber it_css]$ qsub -N array -t 1-4 -o 'array.$TASK_ID'
echo "I am sub-task ${JOB_ID}.${SGE_TASK_ID}"
^D
Your job-array 82709.1-4:1 ("array") has been submitted
[(it_css:traine)@farber it_css]$ ...time passes...
[(it_css:traine)@farber it_css]$ ls -1 array.*
array.1
array.2
array.3
array.4
[(it_css:traine)@farber it_css]$ cat array.3
I am sub-task 82709.3
Four sub-tasks are executed, numbered from 1 through 4. The starting index must be greater than zero, and the ending index must be greater than or equal to the starting index. The //step size// going from one index to the next defaults to one, but can be any positive integer greater than zero. A step size is appended to the sub-task range as in ''2-20:2'' -- proceed from 2 up to 20 in steps of 2, e.g. 2, 4, 6, 8, 10, et al.
==== Partitioning Job Data ====
There are essentially two methods for partitioning input data for array jobs. Both methods make use of the sub-task identifier in locating the input for a particular sub-task.
If the 100 novels were in files with names fitting the pattern ''novel_''<<''sub-task-id''>>''.txt'' then the analysis could be performed with the following ''qsub'' command:
[(it_css:traine)@farber novels]$ qsub -N gerunds -o 'gerund_count.$TASK_ID' -t 1-100
#
# Count gerunds in the file:
#
./gerund_count "novel_${SGE_TASK_ID}.txt"
^D
Your job-array 82715.1-100:1 ("gerunds") has been submitted
When complete, the job will produce 100 files named ''gerund_count.''<<''sub-task-id''>> where the ''sub-task-id'' collates the results to the input files.
An alternate method of organizing the chaos associated with large array jobs is to partition the data in directories: the sub-task identifier is not applied to the filenames, but is used to set the working directory for each sub-task:
[(it_css:traine)@farber novels]$ qsub -N gerunds -o gerund_count -t 1-100
#
# Count gerunds in the file:
#
cd ${SGE_TASK_ID}
../gerund_count novel.txt > gerund_count
^D
Your job-array 82716.1-100:1 ("gerunds") has been submitted
When complete, each directory will have a file named ''gerund_count'' containing the output of the ''gerund_count'' command.
=== Using an Index File ===
The partitioning scheme can be as complex as the user desires. If the directories were not named "1" through "100" but instead used the name of the novel contained within, an index file could be created containing the directory names, one per line:
Great_Expectations
Atlas_Shrugged
The_Great_Gatsby
:
The job submission might then look like:
[(it_css:traine)@farber novels]$ qsub -N gerunds -o gerund_count -t 1-100
#
# Count gerunds in the file:
#
NOVEL_FOR_TASK=`sed -n ${SGE_TASK_ID}p index.txt`
cd $NOVEL_FOR_TASK
../gerund_count novel.txt > gerund_count
^D
Your job-array 82718.1-100:1 ("gerunds") has been submitted
The ''sed'' command selects a single line of the ''index.txt'' file; for sub-task 1 the first line is selected, sub-task 2 the second line, etc.