Using Grid Engine on Mills

This is an old revision of the document!

The Grid Engine job scheduling system is used to manage and control the computing resources for all jobs submitted to a cluster. This includes load balancing, reconciling requests for memory and processor cores with availability of those resources, suspending and restarting jobs, and managing jobs with different priorities. Grid Engine is also known as Oracle/Sun Grid Engine or SGE.

Grid Engine job scheduling system provides an excellent overview of Grid Engine which is the job schedule system used on Mills.

In order to schedule any job (interactively or batch) on a cluster, you must set your workgroup to define your cluster group or investing-entity compute nodes.

See Scheduling Jobs and Managing Jobs for general information about getting started with scheduling and managing jobs on a cluster using Grid Engine.

Each investing-entity on a cluster has four owner queues that exclusively use the investing-entity's compute nodes. (They do not use any nodes belonging to others.) Grid Engine allows those queues to be selected only by members of the investing-entity's group.

There are also node-wise queues, standby, standby-4h, spillover-24core, spillover-48core and idle. Grid Engine allows users to use nodes belonging to other investing-entities. (The idle queue is currently disabled.)

When submitting a batch job to Grid Engine, you specify the resources you need or want for your job. You don't actually specify the name of the queue. Instead, you include a set of directives that specify your job's characteristics. Grid Engine then chooses the most appropriate queue that meets those needs.

The queue to which a job is assigned depends primarily on six factors:

Whether the job is serial or parallel
Which parallel environment (e.g., openmpi, threads) is needed
Which or how much of a resource is needed (e.g., max clock time, max memory)
Whether the job can be suspended and restarted by the system.
Whether the job is non-interactive or interactive
Whether you want to use idle nodes belonging to others.

For each investing-entity, the owner-queue names start with the investing-entity's name:

«investing_entity»`.q+`	The default queue for non-interactive serial or parallel jobs. The primary queue for long-running jobs. These jobs must be able to be suspended and restarted by Grid Engine. They can be preempted by jobs submitted to the development queue, described next. Examples: all serial (single-core) jobs, openMPI jobs, openMP jobs or other jobs using the threads parallel environment.
«investing_entity»`.q`	A special queue for non-suspendable parallel jobs, such as MPICH. These jobs will not be preempted by others' job submissions.
«investing_entity»`-qrsh.q`	A special queue for interactive jobs only. Jobs are scheduled to this queue when you use Grid Engine's qlogin command.
`standby.q`	A special queue that spans all nodes, at most 240 slots per user. Submissions will have a lower priority than jobs submitted to owner-queues, and standby jobs will only be started on lightly-loaded nodes. These jobs will not be preempted by others' job submissions. Jobs will be terminated with notification after running for 8 hours of elapsed (wall-clock) time. Also see the `standby-4h.q` entry.
`standby.q`	You must specify –l standby=1 as a qsub option. You must also use the -notify option if your jobs traps the USR2 termination signal. (Details)
`standby-4h.q`	A special queue that spans all nodes, at most 816 slots per user. Submissions will have a lower priority than jobs submitted to owner-queues, and standby jobs will only be started on lightly-loaded nodes. These jobs will not be preempted by others' job submissions. Jobs will be terminated with notification after running for 4 hours of elapsed (wall-clock) time.
`standby-4h.q`	You must specify –l standby=1 as a qsub option. And, if more than 240 slots are requested, you must also specify a maximum run-time of 4 hours or less via the -l h_rt=hh:mm:ss option. Finally, use the -notify option if your jobs traps the USR2 termination signal. (Details)
`spillover-24core.q`	A special queue that spans all standard nodes (24 cores) and is used by Grid Engine to map jobs when requested resources are unavailable on standard nodes in owner queues, e.g., node failure or other standby jobs are using owner resources. Implemented on February 29, 2016 according to Mills End-of-Life Policy.
`spillover-48core.q`	A special queue that spans all 4-socket nodes (48 cores) and is used by Grid Engine to map jobs when requested resources are unavailable on 48-core nodes in owner queues, e.g., node failure or other standby jobs are using owner resources. Owners of only 48-core nodes will not spillover to standard nodes. Implemented on February 29, 2016 according to Mills End-of-Life Policy.
`spare.q`	A special queue that spans all nodes kept in reserve as replacements for failed owner-nodes. Temporary access to the spare nodes will be granted by request. When access is granted, the spare nodes will augment your owner nodes. Jobs on the spare nodes will not be preempted by others' job submissions, but may needed to be killed by IT. The owner of a job running on a spare node will be notified by email two hours before IT kills the job.

Be considerate in your use of the development queue. It may preempt 'q+' jobs being run by other users in your group if those jobs' computational resources are needed.

You may give a resource request list in the form -l resource=value. A list of available resources with their associated valid value specifiers can be obtained by the command:

qconf -sc

Each named complex or shortcut can be a resource. There can be multiple, comma separated, resource=value pairs. The valid values are determined by the type. Examples, MEMORY type could be 5G (5 GigaBytes), or a TIME type could be 1:30:00 (1 hour 30 minutes).

In a cluster as large a Mills, the two most important resources are cores (CPUs) and memory. The number of cores is called slots. It is listed as a "requestable" and "consumable" resource. Parallel jobs, by definition, will use multiple cores. Thus, the slots resource is handled by the parallel environment option -pe, and you do not need to put it in a resource list.

There are several complexes relating to memory and you will be concerned about how much is free. Memory resources come as both consumable and sensor driven (not consumable). For example:

memory resource	Consumable	Explanation
mem_free	No	Memory that must be available BEFORE job can start
ram_free	Yes	Memory reserved for the job DURING execution

It is usually a good idea to add both resources. The mem_free complex is sensor driven, and is more reliable for choosing a node for your job. The ram_free is consumable, which means you are reserving the memory for future use. Other jobs, using ram_free, may be barred from starting on the node. If you are specifying memory resources for a parallel environment job, the requested memory is multiplied by the slot count.

When using a shared memory parallel computing environment -pe threads, divide the total memory needed by the number of slots. For example, to request 48G of shared memory for an 8 thread job, request 6G (6G per slot).

Example

Consider 30 serial jobs, which each require 20 Gbytes of memory. Use the command

qsub -l mem_free=20G,ram_free=20G -t 1-30 myjob.qs

This will submit 30 jobs to the queue, with the SGE_TASK_ID variable set for use in the myjobs.qs script (an array job.) The mem_free resource will cause Grid Engine to find a node (or wait until one is available) with 20 Gbytes free. The ram_free resource will tell Grid Engine not to schedule too many jobs on the same node. Without the ram_free resource, an available node, which has all 64G free, will be used to start 24 jobs from the 30. This will clearly not work, since only three 20G jobs can run on one node with 64G of memory. With the ram_free resource, 3 jobs will be started on the first available node. Some of your jobs may have to wait until for earlier jobs to complete, but this is better then all jobs being memory starved.

The ram_free complex works best if everyone in your group uses it to schedule all jobs, but even if others in your group do not properly reserve memory with ram_free, you can use it to spread your large memory jobs to multiple nodes.

The /opt/templates/gridengine directory contains basic prototype job scripts for non-interactive parallel jobs. This section describes the –pe parallel environment option that's required for MPI jobs, openMP jobs and other jobs that use the SMP (threads) programming model.

Type the command:

qconf –spl

to display a list of parallel environments available on a cluster.

The general form of the parallel environment option is:

-pe «parallel_environment» «Nproc»

where «Nproc» is the number of processor slots (cores) requested. Just use a single number, and not a range. Grid Engine tries to locate as many free slots as it can and assigns them to that batch job. The environment variable $NSLOTS is given that value.

The two most used parallel environments are threads and openmpi.

The threads parallel environment

Jobs such as those having openMP directives use the threads parallel environment, an implementation of the shared-memory programming model. These SMP jobs can only use the cores on a single node.

For example, if your group only owns nodes with 24 cores, then your –pe threads request may only ask for 24 or fewer slots. Use Grid Engine's qconf command to determine the names and characteristics of the queues and compute nodes available to your investing-entity group on a cluster.

Threaded jobs do not necessarily complete faster when more slots are made available. Before running a series of production runs, you should experiment to determine how many slots generally perform best. Using that quantity will leave the remaining slots for others in your group to request. Remember: others can see how many slots you're using!

OpenMP jobs

For openMP jobs, add the following bash command to your job script:

export OMP_NUM_THREADS=$NSLOTS

IT provides a job script template called openmp.qs available in /opt/templates/gridengine/openmp to copy and customize for your OpenMP jobs.

The openmpi parallel environment

MPI jobs inherently generate considerable network traffic among the processor cores of a cluster's compute nodes. The processors on the compute node may be connected by two types of networks: InfiniBand and Gigabit Ethernet.

IT has developed templates to help with the openmpi parallel environments for a given cluster, targeting different user needs and architecture. You can copy the templates from /opt/templates/gridengine/openmpi and customize them. These templates are essentially identical with the exception of the presence or absence of certain qsub options and the values assigned to MPI_FLAGS based on using particular environment variables. In all cases, the parallel environment option must be specified:

-pe openmpi «NPROC»

where <NPROC> is the number of processor slots (cores) requested. Use a single number, not a range. Grid Engine tries to locate as many free slots as it can and assigns them to that job. The environment variable NSLOTS is given that value.

IT provides several job script templates in /opt/templates/gridengine/openmpi to copy and customize for your Open MPI jobs. See Open MPI on Mills for more details about these job scripts.

Using the exclusive access resource option -l exclusive=1 will block any other jobs from making use of resources on that host.

Using the standby resource option -l standby=1 will target the standby queues for your job.

Using Grid Engine on Mills

Introduction

The job queues on Mills

Resource-management options on Mills

Memory

Example

Parallel environments

The threads parallel environment

OpenMP jobs

The openmpi parallel environment

Exclusive Access

Standby

hpc documentation