Running applications on Mills

The Grid Engine job scheduling system is used to manage and control the resources available to computational tasks. The job scheduler considers each job's resource requests (memory, disk space, processor cores) and executes it as those resources become available. The order in which jobs are submitted and a scheduling priority also dictate how soon the job will be eligible to execute. The job scheduler may suspend (and later restart) some jobs in order to more quickly complete jobs with higher scheduling priority.

Without a job scheduler, a cluster user would need to manually search for the resources required by his or her job, perhaps by randomly logging-in to nodes and checking for other users' programs already executing thereon. The user would have to "sign-out" the nodes he or she wishes to use in order to notify the other cluster users of resource availability¹⁾. A computer will perform this kind of chore more quickly and efficiently than a human can, and with far greater sophistication.

An outdated but still mostly relevant description of Grid Engine and job scheduling can be found in the first chapter of the Sun N1™ Grid Engine 6.1 User's Guide.

In this context, a job consists of:

a sequence of commands to be executed
a list of resource requirements and other properties affecting scheduling of the job
a set of environment variables

For an interactive job, the user manually types the sequence of commands once the job is eligible for execution. If the necessary resources for the job are not immediately available, then the user must wait; when resources are available, the user must be present at his/her computer in order to type the commands. Since the job scheduler does not care about the time of day, this could happen anytime, day or night.

By comparison, a batch job does not require the user be awake and at his or her computer: the sequence of commands is saved to a file, and that file is given to the job scheduler. A file containing a sequence of shell commands is also known as a script, so in order to run batch jobs a user must become familiar with shell scripting. The benefits of using batch jobs are significant:

a job script can be reused (versus repeatedly having to type the same sequence of commands for each job)
when resources are granted to the job it will execute immediately (day or night), yielding increased job throughput

An individual's increased job throughput is good for all users of the cluster!

At its most basic, a queue represents a collection of computing entities (call them nodes) on which jobs can be executed. Each queue has properties that restrict what jobs are eligible to execute within it: a queue may not accept interactive jobs; a queue may place an upper limit on how long the job will be allowed to execute or how much memory it can use; or specific users may be granted or denied permission to execute jobs in a queue.

Grid Engine uses a cluster queue to embody the common set of properties that define the behavior of a queue. The cluster queue acts as a template for the queue instances that exist for each node that executes jobs for the queue. The term queue can refer to either of these, but in this documentation it will most often imply a cluster queue.

When submitting a job to Grid Engine, a user can explicitly specify which queue to use: doing so will place that queue's resource restrictions (e.g. maximum execution time, maximum memory) on the job, even if they are not appropriate. Usually it is easier if the user specifies what resources his or her job requires and lets Grid Engine choose an appropriate queue.

A job scheduling system is used to manage and control the computing resources for all jobs submitted to a cluster. This includes load balancing, limiting resources, reconciling requests for memory and processor cores with availability of those resources, suspending and restarting jobs, and managing jobs with different priorities.

Each investing-entity's group (workgroup) has owner queues that allow the use a fixed number of slots to match the total number of cores purchased. If a job is submitted that would use more than the slots allowed, the job will wait until enough slots are made available by completed jobs. There is no time limit imposed on owner queue jobs. All users can see running and waiting jobs, which allows groups to work out policies for managing purchased nodes.

The standby queues are available for projects requiring more slots than purchased, or to take advantage of idle nodes when a job would have to wait in the owner queue. Other workgroup nodes will be used, so standby jobs have a time limit, and users are limited to a total number of cores for all of their standby jobs. Generally, users can use 10 nodes for an 8 hour standby job or 40 nodes for a 4 hour standby job.

A spillover queue may be available for the case where a job is submitted to the owner queue, and there are standby jobs consuming needed slots. Instead of waiting, the jobs will be sent to the spillover queue to start on a similar idle node.

The Grid Engine job scheduling system is used to manage and control the computing resources for all jobs submitted to a cluster. This includes load balancing, reconciling requests for memory and processor cores with availability of those resources, suspending and restarting jobs, and managing jobs with different priorities. Grid Engine on Farber is Univa Grid Engine but still referred to as SGE.

In order to schedule any job (interactively or batch) on a cluster, you must set your workgroup to define your cluster group or investing-entity compute nodes.

See Scheduling Jobs and Managing Jobs for general information about getting started with scheduling and managing jobs on a cluster using Grid Engine.

Generally, your runtime environment (path, environment variables, etc.) should be the same as your compile-time environment. Usually, the best way to achieve this is to put the relevant VALET commands in shell scripts. You can reuse common sets of commands by storing them in a shell script file that can be sourced from within other shell script files.

If you are writing an executable script that does not have the -l option on the bash command, and you want to include VALET commands in your script, then you should include the line:

source /etc/profile.d/valet.sh

You do not need this command when you

type commands, or source the command file,
include lines in the file to be submitted to the qsub.

Grid Engine includes man pages for all of the commands that will be reviewed in this document. When logged-in to a cluster, type

[traine@mills ~]$ man qstat

to learn more about a Grid Engine command (in this case, qstat). Most commands will also respond to the -help command-line option to provide a succinct usage summary:

[traine@mills ~]$ qstat -help
usage: qstat [options]
        [-cb]                             view additional binding specific parameters
        [-ext]                            view additional attributes
           :

This section uses the wiki's documentation conventions.

You may give a resource request list in the form -l resource=value. A list of available resources with their associated valid value specifiers can be obtained by the command:

qconf -sc

Each named complex or shortcut can be a resource. There can be multiple, comma separated, resource=value pairs. The valid values are determined by the type. Examples, MEMORY type could be 5G (5 GigaBytes), or a TIME type could be 1:30:00 (1 hour 30 minutes).

In a cluster as large a Mills, the two most important resources are cores (CPUs) and memory. The number of cores is called slots. It is listed as a "requestable" and "consumable" resource. Parallel jobs, by definition, will use multiple cores. Thus, the slots resource is handled by the parallel environment option -pe, and you do not need to put it in a resource list.

There are several complexes relating to memory and you will be concerned about how much is free. Memory resources come as both consumable and sensor driven (not consumable). For example:

memory resource	Consumable	Explanation
mem_free	No	Memory that must be available BEFORE job can start
ram_free	Yes	Memory reserved for the job DURING execution

It is usually a good idea to add both resources. The mem_free complex is sensor driven, and is more reliable for choosing a node for your job. The ram_free is consumable, which means you are reserving the memory for future use. Other jobs, using ram_free, may be barred from starting on the node. If you are specifying memory resources for a parallel environment job, the requested memory is multiplied by the slot count.

When using a shared memory parallel computing environment -pe threads, divide the total memory needed by the number of slots. For example, to request 48G of shared memory for an 8 thread job, request 6G (6G per slot).

Example

Consider 30 serial jobs, which each require 20 Gbytes of memory. Use the command

qsub -l mem_free=20G,ram_free=20G -t 1-30 myjob.qs

This will submit 30 jobs to the queue, with the SGE_TASK_ID variable set for use in the myjobs.qs script (an array job.) The mem_free resource will cause Grid Engine to find a node (or wait until one is available) with 20 Gbytes free. The ram_free resource will tell Grid Engine not to schedule too many jobs on the same node. Without the ram_free resource, an available node, which has all 64G free, will be used to start 24 jobs from the 30. This will clearly not work, since only three 20G jobs can run on one node with 64G of memory. With the ram_free resource, 3 jobs will be started on the first available node. Some of your jobs may have to wait until for earlier jobs to complete, but this is better then all jobs being memory starved.

The ram_free complex works best if everyone in your group uses it to schedule all jobs, but even if others in your group do not properly reserve memory with ram_free, you can use it to spread your large memory jobs to multiple nodes.

The /opt/shared/templates/gridengine directory contains basic prototype job scripts for non-interactive parallel jobs. This section describes the –pe parallel environment option that's required for MPI jobs, openMP jobs and other jobs that use the SMP (threads) programming model.

Type the command:

qconf –spl

to display a list of parallel environments available on a cluster.

The general form of the parallel environment option is:

-pe «parallel_environment» «Nproc»

where «Nproc» is the number of processor slots (cores) requested. Just use a single number, and not a range. Grid Engine tries to locate as many free slots as it can and assigns them to that batch job. The environment variable $NSLOTS is given that value.

The two most used parallel environments are threads and openmpi.

The threads parallel environment

Jobs such as those having openMP directives use the threads parallel environment, an implementation of the shared-memory programming model. These SMP jobs can only use the cores on a single node.

For example, if your group only owns nodes with 24 cores, then your –pe threads request may only ask for 24 or fewer slots. Use Grid Engine's qconf command to determine the names and characteristics of the queues and compute nodes available to your investing-entity group on a cluster.

Threaded jobs do not necessarily complete faster when more slots are made available. Before running a series of production runs, you should experiment to determine how many slots generally perform best. Using that quantity will leave the remaining slots for others in your group to request. Remember: others can see how many slots you're using!

OpenMP jobs

For openMP jobs, add the following bash command to your job script:

export OMP_NUM_THREADS=$NSLOTS

IT provides a job script template called openmp.qs available in /opt/shared/templates/gridengine/openmp to copy and customize for your OpenMP jobs.

The openmpi parallel environment

MPI jobs inherently generate considerable network traffic among the processor cores of a cluster's compute nodes. The processors on the compute node may be connected by two types of networks: InfiniBand and Gigabit Ethernet.

IT has developed templates to help with the openmpi parallel environments for a given cluster, targeting different user needs and architecture. You can copy the templates from /opt/shared/templates/gridengine/openmpi and customize them. These templates are essentially identical with the exception of the presence or absence of certain qsub options and the values assigned to MPI_FLAGS based on using particular environment variables. In all cases, the parallel environment option must be specified:

-pe openmpi «NPROC»

where <NPROC> is the number of processor slots (cores) requested. Use a single number, not a range. Grid Engine tries to locate as many free slots as it can and assigns them to that job. The environment variable NSLOTS is given that value.

IT provides several job script templates in /opt/shared/templates/gridengine/openmpi to copy and customize for your Open MPI jobs. See Open MPI on Mills for more details about these job scripts.

Using the exclusive access resource option -l exclusive=1 will block any other jobs from making use of resources on that host.

Using the standby resource option -l standby=1 will target the standby queues for your job.

¹⁾

Historically, this is actually how some clusters were managed!

Running applications on Mills

Introduction

What is a Job?

Queues

Job scheduling system

Grid Engine

Runtime environment

Getting Help

Resource-management options on Mills

Memory

Example

Parallel environments

The threads parallel environment

OpenMP jobs

The openmpi parallel environment

Exclusive Access

Standby

hpc documentation